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Introduction 


This  final  report  summarizes  activities  conducted  under  a contract 
to  develop  a Criterion-Referenced  Test  (CRT)  Construction  Manual.  Major 
objectives  accomplished  by  the  project  were  the  preparation  of  a written 
review  of  the  literature  on  Criterion-Referenced  Testing,  identification 
of  needed  research  to  help  achieve  a more  consistent,  unified  criterion- 
referenced  test  model,  and  development  of  an  easy-to-use,  "how-to-do-it" 
manual  to  assist  Army  test  developers  in  the  construction  of  CRTs. 

In  order  to  accomplish  these  objectives,  the  project  encompassed 
the  following  activities: 

1.  A survey  of  the  literature  on  criterion-referenced  testing 
conducted  in  order  to  provide  an  information  base  for  devel- 
opment of  the  CRT  Construction  Manual. 

2.  Visits  to  selected  Army  posts  to  review  the  present  status  of 
criterion-referenced  test  construction  and  application  in  the 
Army.  Interviews  conducted  during  these  visits  provided  in- 
formation which  aided  in  making  the  CRT  Construction  Manual 
practicable  and  useful,  and  in  identifying  problems  with 
criterion-referenced  testing  that  require  further  research. 

3.  Preparation  of  an  interim  report  based  upon  the  first  two 
tasks  and  upon  review  by  experts  in  the  criterion-referenced 
testing  field. 

4.  Preparation  of  a draft  CRT  Construction  Manual. 

5.  Revision  of  the  draft  manual,  based  upon  feedback  from  expert 
reviews. 

6.  Conduct  of  a field  review  of  the  revised  manual,  in  which 
selected  Army  personnel  used  the  revised  manual  to  construct 
CRTs.  These  personnel  completed  evaluation  packages  during 
the  field  review  indicating  the  utility  of  the  manual  and 
problems  encountered  with  its  use.  In  addition,  other  Army 
personnel  functioning  in  supervisory  capacities,  also  reviewed 
the  manual. 

7.  Final  revision  of  the  CRT  Construction  Manual,  based  upon  the 
findings  of  the  field  review. 

This  report  fulfills  the  contract  requirements  for  a final  report 
summarizing  project  activities. 


iv 


Executive  Summary 


Part  1 of  this  report  describes  procedures  used  for  reviewing  the 
technical  and  theoretical  literature  in  the  areas  of  criterion-referenced 
testing.  Sources  of  the  literature  reviewed,  search  strategies,  and 
topics  covered  are  described. 

Part  2 summarizes  positions  on  theoretical  and  technical  aspects 
of  CRT  construction  and  use,  based  upon  the  state-of-the-art  of  criterion- 
referenced  testing  as  reflected  in  the  literature  review.  These  posi- 
tions were  used  as  the  bases  for  the  procedures  presented  in  the  CRT 
Construction  Manual  developed  during  this  project. 

Part  3 is  a brief  summary  of  the  methodology  used  to  survey  the 
application  of  criterion-referenced  testing  techniques  in  the  Army.  In- 
formation was  collected  to  supplement  the  literature  search  and  review, 
to  provide  detailed  material  on  current  CRT  development  and  use  in  the 
Army,  and  to  obtain  information  concerning  attitudes  on,  and  opinions 
about,  criterion-referenced  measurement,  held  by  Army  testing  personnel. 
Listed  in  this  section  are  topics  covered  by  the  survey.  Development 
of  the  Interview  Protocol  used  in  the  survey  is  also  described,  along 
with  a quick  overview  of  the  various  types  of  personnel  who  participated 
in  the  survey.  , 

Part  4 presents  a summary  and  discussion  of  the  results  from  the 
field  survey  of  CRT  development  and  use  in  the  Army.  General  patterns 
in  test  construction  processes  which  became  apparent  during  the  survey 
are  discussed.  Results  of  the  survey  are  indicated  through  an  analysis 
of  quantitative  data  collected  during  interviews,  and  through  a discus- 
sion of  qualitative  comments,  opinions,  and  anecdotal  information  re- 
corded during  the  interviews.  Problems  observed  in  the  development  and 
use  of  CRTs  by  the  survey  teams  are  described,  and  areas  where  changes 
may  prove  beneficial  to  the  Army  are  mentioned. 

Part  5 describes  tV;  development  of  the  CRT  Construction  Manual. 
Objectives  on  which  the  manual  is  based  are  listed,  and  review  and  re- 
vision procedures  are  discussed. 

Part  6 describes  the  way  ir.  which  the  revised  draft  CRT  Construction 
Manual  was  evaluated  in  the  field,  and  the  results  of  the  field  evalua- 
tion. Additionally,  this  section  presents  a discussion  of  the  field 
evaluation  findings,  in  terms  of  implications  for  further  refinement  of 
the  manual , Developing  Criterion-Referenced  Tests. 

Part  7 presents  recommendations  for  future  research  on,  and  imple- 
mentation of,  criterion-referenced  measurement  in  Army  applications. 
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Appendix  A presents  the  final  version  of  the  Interview  Protocol 
used  in  the  Army  CRT  survey,  while  Appendices  B and  C are  reproductions 
of  materials  used  in  the  field  evaluation  of  the  CRT  Construction  Manual 
Appendix  D consists  of  tallies  of  the  data  received  from  the  field  evalu 
ation,  and  median  response  values. 


Part  1 


Procedure  for  Reviewing  the  Literature 
on  Criterion-Referenced  Testing 


During  conduct  of  this  project,  ASA  reviewed  the  technical  and 
theoretical  literature  on  criterion-referenced  testing.  The  starting 
point  for  this  literature  search  was  a data  base,  developed  by  ASA, 
consisting  of  approximately  2,700  abstracts  and  evaluations  of  journal 
articles,  technical  reports,  military  training  literature,  and  books 
on  instructional  system  development,' including  criterion-referenced 
testing.  During  the  development  of  this  data  base,  nearly  12,000  docu- 
ments were  reviewed,  and  the  most  relevant  were  abstracted  and  evalu- 
ated. Journals  reviewed  included  the  American  Educational  Research 
Technology,  Journal  of  Educational  Research,  Journal  of  Programmed  In- 
struction, Psychological  Record,  and  many  others,  most  of  which  were 
searched  as  far  back  as  1952. 

The  data  base  additionally  included  sources  identified  by  several 
computer  searches,  including  an  ERIC  search,  two  DDC  searches,  a pack- 
aged MEDLARS  search,  and  a search  of  the  HumRRO  KWOC  Index.  All  searches 
used  keywords  such  that  references  pertinent  to  criterion-referenced 
testing  were  likely  to  have  been  captured. 

ASA  used  this  data  base  as  follows: 

1-  The  data  base  was  reviewed  tc  select  all  references  directly 
relevant  to  criterion-referenced  testing. 

2.  References  contained  in  the  literature  selected  as  being  di- 
recti*  relevant  were  followed-up,  thereby  expanding  the  data 
base  ; documents  concerning  criterion-referenced  testing. 

3.  Additional  educational  literature  not  covered  adequately  dur- 
ing the  creation  of  the  original,  instructional  system  devel- 
opment data  base,  was  reviewed,  and  appropriate  documents  were 
added  to  the  criterion-referenced  testing  data  base. 

4.  All  documents  in  t!.°  criterion-referenced  testing  data  base 
were  reviewed,  and  important  points  on  methodology,  results, 
and  critiques  were  documented  in  a cross-referenced  index 
file. 

5.  A review  of  the  literature  was  prepared,  based  on  the  cross- 
referenced  index. 
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Part  2 


Brief  Summary  of  the  State-of-the-Art  In 
Criterion-Referenced  Testing 


The  purpose  of  this  section  is  to  describe  positions  on  theoreti- 
cal and  technical  aspects  of  CRT  construction  and  use,  based  upcn  the 
state-of-the-art  of  CR  testing  as  reflected  in  the  ASA  literature  re- 
view (Swezey,  Pearistein,  and  Ton,  1974).  These  positions  were  used 
as  the  bases  for  the  procedures  presented  in  the  CRT  Construction  Man- 
ual. Positions  are  presented  sequentially  for  the  following  topics: 

1.  Design  considerations  and  CRT  use 

2.  Construction  methodology  and  related  issues 

3.  CRT  administration  and  scoring 

4.  Reliability  and  validity. 

Design  Considerations  and  CRT  Use 

Among  the  major  considerations  in  CRT  construction  is  the  way  in 
which  specific  uses  may  affect  test  design.  Test  design  may  vary  in 
several  related  fundamental  respects,  such  as  the  basis  upon  which  test 
items  are  constructed  and  selected.  In  CR  testing,  items  are  generally 
developed  from  an  analysis  of  casks  to  be  performed  and  from  attempts 
to  operationally  define  the  behaviors  required.  This  is  not  necessarily 
the  case  in  norm  referenced  (NR)  testing.  The  manner  in  which  scores 
are  interpreted  and  used  also  differentiates  CRT3  from  NRTs.  In  CR 
testing,  scores  attained  by  examinees  are  interpreted  against  an  ex- 
ternal, absolute  standard--as  opposed  to  the  distribution  of  scores 
attained  by  other  examinees,-  which  is  the  case  with  NRTs. 

It  must  first  be  decided  whether  a CRT,  as  opposed  to  a NRT,  is 
appropriate.  CRT  scores  do  not  lend  themselves  to  ordering  individuals 
along  a continuum,  thus  if  the  primary  use  of  test  results  is  to  select 
among  individuals  for  promotion,  special  honors,  etc.,  CR  testing  is 
contraindicated.  Whenever  information  is  desired  for  purposes  of  com- 
paring examinees,  NR  testing  appears  to  be  more  appropriate  than  CR  test- 
ing. This  applies  to  tests  of  achievement,  knowledge,  and  performance. 

CR  testing  is  usually  the  technique  of  choice  when  evaluations  are 
to  be  made  on  the  basis  of  an  individual's  achievement  of  specific  ob- 
jectives. Here  the  primary  question  of  interest  is:  "How  well  can  an 
individual  perform  relative  to  an  external  standard?",  rather  than: 

"How  well  does  an  individual  do  compared  to  others?". 
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Cost  Effectiveness 


CRTs  may  be  more  expensive  to  develop  and  administer  than  NRTs, 
in  terms  of  absolute  costs.  CRT-specific  development  costs  are  due 
largely  to  the  need  for  carefully  deriving  and  specifying  objectives, 
while  additional  administration  costs  may  result  from  the  necessity  of 
comparing  examinee  performance  to  external  standards.  Nevertheless,  CR 
testing  may  well  be  more  cost-effective  in  the  long  run,  if  there  is  a 
genuine  need  to  ascertain  an  individual's  ability  to  perform  a specific 
task. 

Indirect  approaches  to  criterion-referencing,  by  correlating  sym- 
bolic performance  and/or  job  knowledge  test  results  with  performance 
measures,  may  be  an  approach  to  alleviating  the  high  cost3  of  CRTs. 

Such  approaches  involve  the  development  of  two  te3ts  at  different  levels 
of  fidelity  for  each  objective,  and  subsequent  validation  of  the  indi- 
rect measures  against  the  performance  measures.  Justification  for  these 
approaches  center  on  savings  in  administration  time  and  co3ts. 

Development  of  direct  CRTs  appears  justified,  desirable,  and 
cost-effective,  if  there  is  a need  to  ensure  that  individuals  will  be 
able  to  perform  adequately  on  the  tasks  for  which  they  are  being  trained. 
When  there  is  a need  for  ensuring  minimal,  absolute  levels  of  perfor- 
mance, CR  testing  is  the  approach  of  choice. 


Screening  and  Diagnosis 

CRTs  are  applicable  for  use  33  screening  devices  in  cases  where 
there  is  a possibility  that  individuals  may  be  able  to  perform  tasks 
without  training.  If  a person  can  achieve  the  criterion  level  on  a CRT, 
he  should  be  able  to  enter  the  job  without  intervening  training.  Simi- 
larly, CRTs  may  be  used  to  determine  the  appropriate  point  in  a train- 
ing cycle  for  an  individual  to  commence  training. 

CRTs  may  also  be  used  as  diagnostic  aids.  Persons  achieving  the 
criterion  level  might  be  channeled  into  advanced  instruction,  or  remedi- 
ation might  be  suggested  for  those  falling  below  criterion  level  on 
certain  objectives.  CR  testing  for  diagnostic  purposes  is  likely  to 
be  more  difficult  and  more  expensive  than  CR  testing  for  achievement 
of  objectives,  because  detailed  documentation  on  the  examinees'  behav- 
ior is  required.  This  may  necessitate  more  examiners  and/or  more 
elaborate  schemes  for  collecting  data. 


Evaluation  of  Instructional  Programs 

Aside  from  the  assessment  of  individual  performance  against  abso- 
lute standards,  CRTs  may  also  be  used  to  evaluate  instructional  pro- 
grams. Here,  the  primary  question  of  interest  is:  "Has  my  instructional 
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program  taught  what  it  is  supposed  to  teach?".  NR  testing  is  less  ap- 
propriate for  such  an  application  than  is  CR  testing,  since  wide  score 
ranges  before  and  after  administration  of  the  instructional  program 
are  not  necessarily  germane  to  the  question  of  interest.  CRTs  designed 
for  this  application  are  presumably  based  directly  upon  instructional 
objectives  since  the  basic  question  is  whether  or  not  the  program  has 
successfully  taught  performance  compatible  with  the  instructional  ob- 
jectives. CRTs  thus  provide  data  having  direct  relevance  to  the 
question. 


Construction  Methodolc 


and  Related  Issues 


Due  to  the  relative  recency  of  the  CR  testing  concept,  many  theo- 
retical and  practical  aspects  of  CRT  construction  methodology  are  not  , 
so  well  defined  as  is  the  case  for  NRTs.  Additional  sophistication  in 
CRT  construction  methodology  must  await  further  research  on  theoretical 
issues,  and  results  from  more  extensive  attempts  at  CRT  implementation. 
Nevertheless,  some  general  "do's  and  don'ts"  for  CRT  construction  can 
be  extracted  from  the  methodological  literature. 


Task  Analysis 


Pirst,  CRT  construction  requires  careful  analysis  of  the  tasks 
comprising  the  test's  subject.  While  conduct  of  the  task  analysis  it- 
self may  be  outside  the  test  developer’s  domain,  the  test  developer 
must  obtain  analytic  data  on:  (1)  skills  and  knowledges  necessary  for 
task  performance,  (2)  required  performances  stated  in  behavioral  terms, 
(3)  criteria  associated  with  each  identified  performance,  and  (4)  con- 
ditions under  which  the  tasks  must  be  performed. 


Without  these  data,  the  test  developer  cannot  adequately  define 
objectives,  and  consequently  cannot  match  test  items  to  objectives. 
Nor  can  he  ensure  the  content  validity  of  the  test.  If  usable  CRTs 
are  to  be  constructed,  task  analyses  are  necessary  prerequisites. 


Preparing  Objectives 


Preparing  objectives  is  one  of  the  first  formal  steps  in  construct- 
ing a CRT.  Mager  (1962)  has  documented  a useful  procedure  for  format- 
ting these  objectives.  Mager’ s suggestions  for  structuring  objectives 
also  appear  appropriate.  Information  to  be  used  in  preparing  objec- 
tives is  best  derived  from  thorough  task  analytic  data. 


If  the  test  developer's  input  includes  a list  of  unitary  objectives — 
objectives  covering  separate,  single  tasks — as  is  assumed  in  the  case  of 
the  CRT  test  construction  process  presented  in  the  ASA  Manual,  the  test 
developer's  primary  task  is  to  match  test  items  to  these  objectives. 
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The  test  developer  must  assume  that  objectives  are  properly  matched 
to  the  actual  job  tasks.  If  this  assumption  is  violated,  the  result- 
ing CRT  will  lack  content  validity.  If  however,  the  assumption  is  ac- 
curate, and  the  developer  properly  matches  items  to  objectives,  content 
validity  will  be  achieved.  Thus,  the  test  developer  must  be  knowledge- 
able about  appropriate  formats  and  quality  standards  for  objectives  in 
order  to  make  an  adequate  assessment  of  their  suitability  for  CRT 
development. 


Matching  Items  to  Objectives 

Mager  (1973)  has  provided  a sound  plan  for  matching  CRT  items  to 
objectives.  Mager’s  plan  involves  matching  performances  and  conditions 
stated  in,  or  implied  by  objectives,  with  corresponding  item  performances 
and  conditions.  Mager’s  plan  omits  a procedure  for  matching  standards 
among  objectives  and  test  items,  however  implies  that  standards  should 
also  be  matched. 

The  test  constructor 1 s task  is  to  create  test  items  that  are  con- 
gruent with  objectives.  To  the  extent  that  objectives  are  "fuzzy,"  the 
test  constructor  cannot  create  appropriate  items.  It  is  recommended 
that  he  send  fuzzy  objectives  back  to  their  originator,  annotating  their 
difficulties  and  requesting  a reconsideration. 

When  the  test  developer  has  received  an  adequate  objective  (or 
set  of  objectives)  for  which  a test  is  to  be  constructed,  a number  of 
factors  must  be  considered  before  items  are  matched  to  objectives.  These 
factors  include:  practical  constraints  in  the  testing  situation,  test 
fidelity,  test  format,  and  number  of  items  required  to  test  a given 
objective. 

Practical  constraints  must  be  systematically  assessed  before  test 
items  can  be  constructed  so  that  the  items  can  be  built  with  performance 
indicators  which  are  suitable  for  such  considerations  as:  testing  con- 
ditions, tester  availability,  time  availability,  facility  and  equipment 
availability,  etc.  These  considerations  obviously  impact  on  test  fidel- 
ity. CRT  items  should  be  constructed  at  the  highest  level  of  fidelity 
practicable,  consistent  with  situational  constraints.  In  cases  where 
critical  objectives  are  to  be  tested,  special  care  must  be  taken  to 
develop  sufficiently  high  fidelity  items  so  that  critical  task  mastery 
can  be  accurately  assessed. 


Selecting  Among  Objectives  ' 

The  tactic  of  selecting  among  objectives,  that  is,  randomly  test- 
ing a subset  of  objectives,  may  be  used  in  some  instances,  as  long  as 
trainees  do  not_  know  the  subset  to  be  tested.  This  tactic  must  not  be 
used  when  critical  objectives  are  involved.  For  objectives  of  a 
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non-critical  nature,  selection  may  be  used  to  overcome  practical  con- 
straints imposed  by  the  testing  situation,  without  necessitating  modi- 
fication of  objectives.  Selection  among  objectives  should  never  be 
done  when  it  is  necessary  to  certify  that  individuals  qualify  on  all 
objectives. 


Number  of  Items 

No  hard  and  fast  rules  for  specifying  the  number  of  items  to  be 
created  for  a given  abjective  exist.  It  is  recommended  that  as  many 
items  as  test  situation  timt»  availability  will  permit,  within  limits 
suggested  by  considerations  of  motivational  and  fatigue  factors,  should 
be  included.  As  Graham  (1974)  has  noted,  "even  for  highly  homogeneous 
tests,  four  or  five  items  may  be  necessary  to  minimize  classification 
errors."  Thus,  even  for  CRTs  measuring  a single,  well-specified  objec- 
tive with  'aw  confounding  factors,  additional  items  may  help  to  reduce 
measurement  error.  For  more  heterogeneous  tests,  the  desirability  of 
having  extra  items  may  be  even  more  pronounced. 


Format 

Test  format  may,  in  many  cases,  be  largely  dictated  by  objectives. 
Certain  objectives  for  example,  may  require  hands-on  performance  test- 
ing. Such  things  as  number  of  items  to  be  included,  and  practical  con- 
straints such  as  time  and  manpower  availability,  may  also  help  determine 
format — e.g.,  a situatiortal  item,  multiple-choice  format  might  be  the 
only  feasible  way  of  testing  come  sets  of  objectives.  A general  guide- 
line might  be  based  on  Edgerton's  (1974)  suggestion,  that  item  stvles 
not  be  mixed  in  the  same  test,  so  as  to  avoid  measuring  "test  taking 
skill"  instead  of  subject  matter  competence. 

Item  generation  rules,  such  as  "item  forms"  and  "facets"  are  not 
yet  sufficiently  researched  to  warrant  use  by  personnel  who  are  not 
sophisticated  in  psychometrics.  Hence,  for  objectives  that  may  be 
tested  by  an  unlimited  number  of  items,  such  as  those  dealing  with  con- 
cepts, the  best  suggestion  that  can  be  offered  testing  personnel  at 
this  time,  is  to  be  sure  that  each  item  matches  the  objective  it  tests. 

I 

Item  Pools 

After  the  test  developer  has  considered  cuch  factors  as  fidelity, 
number  of  items,  etc.,  items  can  be  matched  to  objectives  using  prin- 
ciples similar  to  those  advanced  by  Mager  (1973) . The  test  developer 
should  construct  a pool  of  items  considerably  larger  than  the  number 
require)  for  the  test,  so  that  the  best  items  can  be  selected.  Items 
are  then  constructed  at  the  level  of  fidelity  and  in  the  format  previ- 
ously determined. 
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Item  Analysis 


Traditional  item  analysis  techniques,  like  other  statistical  tech- 
niques developed  in  conjunction  with  NR  testing,  have  limited  applica- 
bility for  CR  testing  (due  to  restricted  ranges  of  score  variance  in 
CRTs) . Although  recent  studies  have  suggested  techniques  for  increas- 
ing variance  of  CRT  scores  (e.g.,  Haladyna,  1973;  Woodson,  1973)  these 
techniques  are  "experimental,"  and  it  is  not  yet  appropriate  to  apply 
them  as  a matter  of  course.  Consequently,  until  additional  research 
develops  and  refines  new  approaches  to  item  analysis  appropriate  for  CR 
testing,  a simple  index  which  relies  on  the  use  of  "masters"  and  "non- 
masters" (e.g.,  those  who  are  beginning  training  and  those  who  have 
completed  training)  appears  to  be  an  appropriate  technique. 

"Masters"  and  "non-masters"  are  tested  and  their  patterns  of  pass 
and  fail  on  the  items  are  recorded.  <£  coefficients  are  computed  using 
four-fold  tables  ("master"-"nonmaster,"  pass-fail)  for  each  item.  Good 
items  are  those  which  are  passed  by  "masters"  and  failed  by  "nonmasters." 
Items  are  poor  if  there  is  little  difference  on  pass-fail  patterns  be- 
tween "masters"  and  "nonmasters,"  or  if  more  "nonmasters"  than  "masters" 
pass  them.  Low  or  negative  coefficients  act  as  warning  flags.  Items 
receiving  low  coefficients  should  either  be  thrown  out  or,  at  least, 
reconsidered  carefully  before  inclusion  in  a CRT.  These  warning  flags 
are  relevant  if  the  pool  of  items  is  homogeneous,  or  if  it  is  composed 
of  items  testing  several  objectives. 

All  items  should  also  be  reviewed  via  peer  evaluation,  subject 
matter  expert  evaluation,  and  by  appropriate  test  evaluation  units. 

Care  must  be  exercised  to  ensure  that  all  objectives  are  represented 
by  the  proper  number  of  items,  as  determined  previously.  Item  balance 
among  disparate  objectives  measured  by  the  same  test  should  be  main- 
tained as  planned. 


CRT  Administration  and  Scoring 
Administration 


Like  all  tests,  CRTs  must  be  administered  under  standardized  con- 
ditions. CRTs  should  include  accompanying  documentation  which  speci- 
fies: (1)  test  administration  conditions;  (2)  instructions;  (3)  admin- 

istration procedures  (including  how  to  handle  questions,  how  to  check 
and  set  up  test  supplies  and  equipment,  etc.);  (4)  circumstances  for  ex- 
cusing examinees  from  the  test,  due  to  illness,  fatigue,  etc.;  (5)  en- 
vironmental circumstances  under  which  test  administration  should  be 
cancelled;  and  (6)  scoring  procedures. 

Test  administrators  must  be  trained  to  follow  specifications  pre- 
cisely. Since  specifications  will  apply  to  any  test,  documentation 
accompanying  a specific  CRT  need  not  necessarily  be  extremely  detailed — 


except  for  special  requirements  such  as  setting  up  the  test  facility, 
and  test  scoring. 


Scoring 

Test  scoring  procedures  must  be  developed  during  the  test  construc- 
tion process,  since  they  will  generally  vary  as  a function  of  the  type 
of  CRT.  There  are  a number  of  interrelated  decisions  that  must  be  made 
concerning  scoring.  These  include: 

1.  Objectivity  of  scoring 

2.  Process  vs  product  scoring  methods 

3.  Type  of  scoring  (go/no-go,  rating  scales,  etc.) 

4.  Cut-off  points 

5.  Non-interference  vs  assist  methods. 


Objectivity 

Every  attempt  should  be  made  to  maximize  objectivity  in  scoring 
CRTs.  In  low  fidelity  tests,  such  as  those  using  multiple-choice  for- 
mats, objectivity  is  apparent.  (Such  tests  can  be  computer-scored.) 

In  higher  fidelity  CRTs,  it  is  relati%’ely  simple  to  maximize  objectivity 
for  hard-skill  subjects,  however  soft-skill  area3,  such  as  tactics, 
leadership,  etc.  are  more  difficult  to  test  objectively.  To  the  extent 
that  objectivity  is  not  achieved,  reliability  is  attenuated.  Efforts 
must  be  made  to  specify  soft-skill  objectives  precisely,  so  that  ap- 
propriate items  (with  associated  objective  scoring  procedures)  can  be 
prepared.  Even  in  the  best  of  circumstances,  however,  soft-skill  CRTs 
will  probably  have  less  objective  scoring  guides  than  will  tests  of 
hard-skill  subjects.  One  way  to  maximize  objectivity  in  scft-skill  CR 
testing  is  to  require  several  raters  to  assess  each  individual.  Inter- 
rater reliability  can  then  be  calculated.  If  low  inter-rater  relia- 
bility is  found  consistently,  the  test  should  be  revised. 


Process -Product 

R.  G.  Smith's  (1965)  guidelines  for  determining  process  versus 
product  measurement  appear  adequate,  with  slight  modifications.  That 
is.  product  measurement  is  always  appropriate  if  the  objective  speci- 
fies a product.  When  a product  measure  is  called  for,  it  should  be  in- 
corporated into  the  objective,  and  carried  over  into  the  test  items. 
Product  measures  are  called  for  when: 

(a)  the  product  can  be  measured  as  to  presence  or  characteristics 

(b)  the  procedure  leading  to  the  product  can  vary  without  affect- 
ing the  product. 
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Process  measurement  is  indicated  when  the  objective  specifies  a 
required  sequence  of  performances  which  can  be  observed,  and  the  per- 
formance is  as  important  as  the  product.  Process  measurement  is  also 
appropriate  in  cases  where  the  product  cannot  be  measured  for  safety 
or  other  constraining  reasons. 

There  may  also  be  situations  where  both  process  and  product  meas- 
urement are  appropriate  for  a given  objective.  following  are  several 
examples  of  conditions  that  may  call  for  both  product  and  process 
measurement: 

r 

(a)  Although  the  product  is  more  important  than  the  process (es) 
which  lead  to  its  completion,  there  are  critical  steps  which, 
if  misperformed,  may  cause  damage  to  equipment  or  injury  to 
personnel. 

(b)  The  process  and  product  are  of  similar  importance,  but  it  can- 
not be  assumed  that  the  product  will  meet  criterion  levels. 

(c)  Diagnostic  information  is  needed.  (By  having  process  as  well 
as  product  measures,  information  as  to  why  the  product  does 
not  meet  the  criterion  can  be  obtained.) 

When  both  process  ana  product  measures  are  obtained  for  a specific 
objective,  scoring  must  follow  the  criterion  specified  by  the  objective 
That  is,  if  the  criterion  specifies  only  a product,  then  process  scores 
should  not  be  used  to  assess  achievement  of  the  criterion. 


Type  of  Scoring 

The  type  of  scoring  system  employed  must  be  appropriate  for  the 
objective.  If  the  objective  specifies  an  action  or  product,  a go/no-go 
scoring  system  should  be  used  (either  the  action  occurs  in  the  proper 
sequence  or  it  does  not;  either  the  product  results  or  it  does  not) . 

If  the  objective  specifies  characteristics  of  a criterion-level  product 
or  action,  a rating  scale  or  other  form  of  point  assignment  is  indi- 
cated. Point  assignments  must  be  made  on  an  explicit,  well-defined 
basis  for  each  item.  For  rating  scales,  inter-rater  reliability  must 
be  high.  Point  assignments  must  be  tied  to  criterion  levels  specified 
in  the  objective. 


Cut-Off  Points 


Cut-off  levels  should  reflect  mastery  of  the  objective  to  the  ex- 
tent required.  Since  factors  other  than  ability  to  perform  a task 
(such  as  careless  errors,  measurement  errors,  etc.)  may  affect  an  in- 
dividual's score,  cut-off  levels  are  often  set  somewhat  below  100  per- 
cent. If,  for  example,  an  objective  calls  for  multiplication  of  two 
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four-digit  numbers,  the  criterion  might  specify  performing  10  such  sets 
within  five  minutes,  achieving  the  correct  answer  in  at  least  eight 
cases.  Thus,  the  cut-off  score  of  8 (below  8 * fail)  reflects  an  ar- 
bitrary definition  of  mastery.  True  mastery  would  require  10  out  of  10. 

Graham  (1974)  has  made  some  valuable  suggestions  corcerning  the 
setting  of  cut-off  points.  The  cut-off,  basically,  should  discriminate 
masters  from  non-masters.  However,  as  item  domains  become  more  broad, 
more  heterogeneous  item  sets  are  required.  Thus,  the  confounding  in- 
fluence of  skills  and  knowledges  which  are  not  directly  related  to  ob- 
jectives increases.  For  tests  measuring  objectives  having  broad  domains 
(or  several  objectives  with  different  domains)  the  overlap  between  mas- 
tery and  non-mastery  scores  consequently  widens. 

When  little  overlap  occurs  between  mastery  and  non-mastery  scores 
(as  is  the  case  for  tests  measuring  a single  objective  with  a relatively 
restricted  domain)  setting  a cut-off  score  is  less  critical.  The  cut- 
off point  should  reflect  the  standard  specified  by  the  objective,  and 
can  do  so  without  falling  into  the  zone  of  overlap  between  masters  and 
non-masters,  since  this  zone,  by  definition,  is  either  narrow  or  non- 
existent. On  the  other  hand,  if  the  overlap  is  wide,  the  point  at  which 
the  cut-off  score  is  set,  is  critical.  Wherever  the  cut-off  score  is 
set,  there  will  be  some  misclassification.  In  such  cases,  there  are 
two  considerations.  First,  objectives  must  be  specified  precisely, 
with  item  domains  as  restricted  as  possible,  in  order  to  narrow  the 
mastery-nonmastery  overlap.  When  achievement  of  several  objectives  of 
disparate  nature  are  measured  by  a single  test,  separate  scores  for 
each  objective's  item  set  should  be  obtained,  each  with  its  own  cut-off. 
However,  for  end-of-course  or  end-of-cycle  exams  which  assess  high 
levels  of  skill  and  knowledge  integration,  a single  cut-off  may  be 
set,  since  what  is  to  be  evaluated  <s  a cluster  of  skills  and  knowledges 
applied  in  combination. 

Second,  costs  of  false  positives  and  false  negatives  must  be  con- 
sidered. If  the  costs  for  false  negatives  are  relatively  high  (e.g., 
manpower  needs  are  critical)  the  cut-off  score  might  justifiably  be 
lowered.  If  the  costs  of  false  positives  are  high,  then  cut-off  scores 
must  remain  high.  In  any  case,  when  performance  on  critical  tasks  is 
tested,  cut-off  points  must  be  kept  high  enough  to  reflect  the  standards 
specified  in  the  objectives  for  those  tasks. 


Assist  vs  Non-Interference 

In  general,  a non-interference  method  of  test  administration  is 
preferred  over  an  assist  method,  in  CR  testing  applications.  In  the 
assist  method,  the  examinee  is  scored  no-go  for  a missed  item,  corrected, 
and  then  allowed  to  proceed.  A major  problem  here,  is  that  if  the  cri- 
terion requires  an  examinee  to  complete  a chain  of  steps,  he  should  be 
tested  on  to  his  ability  to  do  so.  On  the  job,  the  examinee  will  have 
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to  complete  the  chain  of  steps  correctly,  with  no  help.  There  are  how- 
ever, cases  in  which  an  assist  scoring  technique  can  be  profitably  used. 
These  involve  uses  of  CR  testing  for  diagnosis.  In  such  cases,  the 
trainee  is  permitted  to  complete  a chain  of  steps  and  given  assistance 
on  those  which  he  cannot  perform  adequately.  He  is  typically  scored 
no-go  for  steps  where  he  is  assisted.  The  record  of  no-go  steps  is  a 
useful  diagnostic  tool — remediation  can  concentrate  on  missed  steps. 

Such  records  may  also  be  useful  for  evaluating  instructional  material, 
especially  if  many  examinees  have  similar  patterns  of  no-go  items. 


Reliability  and  Validity 
Reliability 


Techniques  foir  assessing  CRT  reliability  are,  for  the  most  part, 
either  not  fully  developed  or  are  based  on  questionable  assumptions. 
(For  example,  see  Livingston,  1972;  Oakland,  1972;  Haladyna,  1974;  and 
Woodson,  1974.)  The  need  for  additional  work  in  the  area  of  CRT  relia- 
bility continues  to  be  a pressing  one. 

A practical  solution  is  to  assess  test -retest  reliability  of  CRTs, 
a procedure  which  does  not  depend  on  internal  consistency,  and  which 
increases  the  variability  of  test  results,  because  of  the  two  test  ad- 
ministrations required.  The  coefficient  is  useful  for  analyzing  the 
resulting  fourfold  (first  administration-second  administration,  pass- 
fail)  data.  <p  values  less  than  +.50  would  indicate  unacceptable  test- 
retest  reliability  for  CRTs. 


Validity 

Content  validation  is  an  especially  appropriate  method  in  CRT 
applications.  A CRT  is  content  valid  if  the  test  items  are  carefully 
based  on  the  performances,  conditions,  and  standards  specified  in  the 
objectives  and  if  the  test  items  appropriately  sample  objectives.  (Of 
course,  the  objectives  themselves  must  be  sound.)  Thus,  in  most  in- 
stances, careful  test  construction  will,  itself,  enable  the  development 
of  content  valid  CRTs.  However,  in  instances  where  low  fidelity  CRTs 
are  constructed,  it  may  be  more  difficult  to  determine  content  validity, 
since  the  items  are  not  likely  to  be  precisely  matched  to  objectives. 

In  such  cases,  there  are  two  additional  types  of  criterion-related  vali- 
dation that  are  well-suited  to  CRTs;  concurrent  validity  and  predictive 
validity. 

In  determining  concurrent  validity,  CRT  results  are  compared  with 
an  outside  measure  of  the  behaviors  tested  by  the  CRT.  This  outside 
measure  must  be  the  best  available  assessment,  of  performance  on  the 
objective's)  in  question.  The  assessment,  of  concurrent  validity,  in- 
volves individual  assessment  via  the  CRT  and  the  outside  measure  close 


2-10 


together  in  time  (concurrently) . <J>  again  is  used  on  the  four-fold  data 

(CRT-other  measure,  pass-fail) 

Predictive  validity  involves  the  same  assumptions.  The  outside 
measi'.re  must  be  an  accurate  measure  of  the  performance  in  question,  or 
the  validation  will  be  meaningless.  Predictive  validity  is  calculated 
the  same  way,  except  the  outside  measure  is  taken  at  a later  time— i.e., 
when  the  individuals  are  actually  performing  the  job  for  which  they've 
been  trained.  The  estimate  is  calculated  just  as  for  concurrent 
validity. 
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Part  3 


Field  Survey  Methodology* 


A variety  of  Army  installations  were  visited  in  order  to  survey 
the  application  of  criterion-referenced  testing  techniques  in  the  mili- 
tary. Information  was  collected  to  supplement  the  literature  search 
and  review,  to  provide  detailed  material  on  CRT  development  and  use  in 
the  Army,  and  to  obtain  information  on  attitudes  and  opinions  of  Army 
testing  personnel. 

Specifically,  the  survey  gathered  data  on: 

1.  How  CRTs  are  developed  for  Army  applications. 

2.  How  CRTs  are  administered  in  various  Army  contexts. 

3.  How  CRT  results  are  used  in  the  Army. 

4.  Extent  of  criterion-referenced  testing  in  the  Army. 

5.  The  level  of  personnel  who  will  use  the  CRT  Construction  Manual 
developed  during  the  present  project. 

6.  Problems  encountered  by  Army  testing  personnel  in  the  develop- 
ment and  use  of  CRTs. 

7.  Attitudes  of  Army  testing  personnel  toward  the  development  and 
use  of  CRTs. 

8.  Opinions  on  the  probable  future  course  of  criterion-referenced 
testing  in  the  Army. 

9.  Sample  Army  CRTs  and  problems  in  developing  and  using  them. 

An  interview  protocol  was  developed  for  on-site  use  at  Army  posts, 
to  enable  standardized  collection  of  information  pertaining  to  the  topics 
listed  above.  Development  of  the  protocol  included  several  review  phases.' 
during  which  revised  versions  of  the  protocol  were  prepared.  The  final 
protocol  combined  separate  versions  for  test  constructors,  test  users, 
and  supervisory  personnel;  and  included  several  optional  items  for  use 
in  interviews  with  personnel  who  were  especially  knowledgeable  about 
criterion-referenced  testing.  Thus,  the  final  protocol  had  a high  degree 


* This  section  is  a brief  summary  of  the  methodology  used  to  survey  CRT 
development  and  use  in  the  Army.  For  a more  detailed  description  of 
the  methodology,  see  Swezey,  Pearlstein  and  Ton,  1974. 
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of  utility,  and  was  flexible  with  respect  to  the  range  of  topics  ad- 
dressed. Using  this  protocol,  interviews  were  easily  tailored  to  the 
ranges  of  responsibilities,  experience,  and  knowledge  possessed  by  in- 
dividual interviewees.  Appendix  A to  this  report,  is  a copy  of  the 
final  version  of  the  protocol. 

The  interview  protocol  was  used  by  ASA  teams  in  a series  of  one- 
on-one  interviews  conducted  during  the  months  of  January,  February, 
and  March,  1974.  Installations  surveyed  during  this  period  included 
the  Infantry  School  at  Fort  Bennir.g,  the  Artillery  School  at  Fort  Sill, 
the  Air  Defense  School  at  Fort  Bliss,  the  Armor  School  at  Fort  Knox, 
and  BCT  and  AIT  units  at  Fort  Ord.  In  addition,  test-related  depart- 
ments were  surveyed  at  each  post.  A total  of  105  individuals  were 
interviewed. 

ASA  survey  teams  spent  three  days  on-site  at  each  post  surveyed. 
Interviews  ranged  in  duration  from  approximately  one-half  to  three  hours 
apiece.  An  average  interview  took  about  one  and  one-half  hours.  Inter- 
view length  was  at  the  interviewer's  discretion,  based  on  the  utility  of 
information  obtained  from  an  interviewee. 

Personnel  in  several  Combat  Arms  Schools,  MOS  testing  areas.  Train- 
ing Extension  Course  (TEC) , and  Training  Center  (BCT  and  AIT)  testing 
programs  were  interviewed.  Figure  3-1  shows  the  number  and  types  of 
individuals  interviewed  in  each  of  these  categories.  Each  interviewee 
responded  to  most  of  the  protocol  items. 


Figure  3-1:  Types  of  Interviewees  in  the  Field  Survey 
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Responses  to  protocol  items  that  were  easily  and  meaningfully  quan- 
tifiable were  tallied,  and  percentages  of  various  types  of  personnel 
responding  in  specified  ways  were  computed.  Responses  to  other  items 
that  elicited  opinion,  anecdotal,  and  process  data  were  summarized  by 
extracting  and  comparing  verbal  descriptions. 
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Part  4 


Field  Survey  Results  and  Discussion* 


Results 

Test  Construction 


Although  details  of  Army  test  construction  processes  vary  widely 
across  and  within  Army  posts,  some  general  patterns  became  apparent 
during  the  field  survey.  These  include  the  following: 

• Test  personnel  (both  developers  and  supervisors)  are  often  also 
involved  in  preparing  objectives,  including  evaluation  standards. 

• Practical  constraints  in  the  testing  situation  are  frequently 
considered  during  test  development. 

• Although  the  majority  of  test  personnel  interviewed  are  involved 
in  the  actual  creation  of  test  items,  only  a minority  create 
item  pools,  i.e.,  write  more  items  than  are  required  for  a single 
form  of  the  test. 

• Item  analysis  techniques  are  not  generally  used  to  select  final 
items  for  tests.  Statistical  item  analysis  techniques  are  al- 
most never  used. 

• Test  reliability  and  validity  are  almost  never  assessed  in  a 
formal  manner,  and  are  rarely  considered  even  informally. 


Test  Administration 

A large  proportion  of  interviewees  in  the  survey  were  involved  in 
administering  tests.  This  is  not  surprising  since  much  test  develop- 
ment is  done  by  school  instructors;  thus,  individuals  who  create  test 
items  also  administer  the  tests  in  their  classes.  It  was  also  found 
that  an  "assist"  method  of  scoring  is  frequently  used.  Test  adminis- 
trators often  find  it  appropriate  to  provide  help  to  individuals  taking 
the  test.  The  assist  method  is  often  used  in  cases  where  the  exaninee 
could  not  otherwise  complete  the  test  (e.g.,  a checkout  procedure). 


* This  section  is  a summary  and  discussion  of  the  results  from  the  field 
survey  of  CRT  development  and  use  in  the  Army.  A more  detailed  com- 
pendium of  the  results  is  provided  in  this  project's  Interim  Report 
(Swezey,  Pearlstein  and  Ton,  1974). 
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Less  than  half  of  the  100  interviews  queried  said  that  they  used 
go  no-go  scoring  standards  on  their  tests.  This  does  not  imply  that 
•ore  than  half  of  the  individuals  in  our  survey  necessarily  use  norma- 
tive scoring  standards;  instead  many  use  point  scales — some  of  which 
are  criter ion-referenced--for  scoring. 

Many  cases  in  which  retesting  is  done  as  a matter  of  course,  were 
cited.  For  example,  in  BCT,  AIT,  and  other  hands-on  performance  test- 
ing situations,  trainees  are  often  given  second  and  third  chances  to 
p4ss  particular  performance  items.  Considerably  less  than  half  of  the 
interviewees  questioned  said  that  they  were  familiar  with  team  perfor- 
mance testing  situations,  but  siany  indicated  that  team  performace  test- 
ing is  often  individual  evaluation  in  a team  context.  The  actual  test- 
ing of  team  performance  on  the  Army  posts  visited,  is  very  limited. 


Using  Test  Results 

The  survey  found  that  the  most  common  uses  of  test  results,  other 
than  for  evaluation  of  trainee  performance,  are  for  improving  training 
and  for  diagnostic  purposes.  Seventy-two  percent  of  the  interviewees 
questioned  indicated  that  they  use  test  results  for  individual  diagnostic 
purposes.  Seventy-three  percent  of  the  interviewees  questioned  indi- 
cated that  they  use  feedback  from  tests  to  improv®  courses.  The  in 
which  this  feedback  is  used,  varies  widely.  For  example,  some  senior 
instructors  indicated  that  if  many  trainees  from  a particular  instruc- 
tor’s class  perform  poorly  on  certain  parts  of  a test,  they  would  first 
evaluate  the  instructor.  If  several  classes  taught  by  different  in- 
structors scored  poorly  on  a section  of  a test,  the  senior  instructor 
might  review  the  materials  used  in  that  portion  of  the  course.  In  other 
situations,  the  test  itself  is  reviewed  using  feedback  from  the  students. 

Finally,  less  than  two-thirds  of  the  interviewees  questioned  indi- 
cated that  test  results  are  used  to  compare  trainees,  and  that  such  com- 
parisons are  not  made  frequently.  It  is  fortunate  that  comparisons  of 
this  nature  are  not  made  more  often  since  the  process  of  making  indi- 
vidual comparisons  based  upon  test  results  is  a norm-referenced 
application. 


Types  of  Tests 

The  survey  discovered  that  most  tests  (about  88%)  are  either  paper- 
and-pencil  knowledge  tests  or  hands-on  performance  tests,  as  opposed  to 
simulated  performance  or  other  types  of  tests.  According  to  the  inter- 
viewees, paper-and-pencil  knowledge  tests  account  for  nearly  50%  of 
those  created  and  used;  however,  since  many  interviewees  confused 
paper-and-pencil  knowledge  tests  with  paper-and-pencil  performance 
tests,  a more  realistic  estimate  is  that  approximately  25%  of  the  tests 
are  paper-and-pencil  knowledge-type,  and  approximately  25%  are  paper- 
and-pencil  performance-type  (e.g.,  trajectory  computations). 
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Survey  results  indicate  that  nearly  three-quarters  of  the  tests 
constructed  or  used  are  performance  tests  of  one  sort  or  another.  These 
results  suggest  that  performance  testing  has  become  widespread  in  many 
phases  of  Army  evaluation.  The  survey  also  showed  that  tests  measur- 
ing specific  skill  and  knowledge  requirements,  and  those  used  at  ends 
of  blocks  of  instruction,  account  for  about  70%  of  test  construction 
and  use.  Mid-cycle  tests  and  end-of-course  tests  together  account  for 
less  than  one-quarter  of  the  tests.  According  to  the  interviewees, 
tests  are  well  distributed  throughout  instruction,  thereby  providing 
frequent  feedback  and  the  possibility  for  on-going  remediation., 


Problems  in  Constructing  and  Using  CRTs 

Over  two-thirds  of  the  interviewees  indicated  that  increased  short- 
term expense  may  be  a problem  in  the  development  and  use  of  CRTs,  but 
that  in  the  long  run,  criterion-referenced  testing  is  less  expensive 
than  is  norm-referenced  testing. 

Many  individuals  in  the  survey  sample  felt  that  time  pressures, 
and  to  a lesser  degree  other  constraints,  often  prevent  successful  con- 
struction and  use  of  tests;  however,  time  pressures  and  other  constraints 
do  not  usually  interfere  with  test  administration  tasks.  Usually,  tests 
are  administered  satisfactorily  despite  time  pressures. 

Interviewee  Attitudes  on  Criterion-Referenced  Testing 

In  general,  interviewees  were  in  favor  of  the  Army  trend  toward 
criterion-referenced  testing.  Eighty-eight  percent  of  the  individuals 
responding,  felt  that  criterion-referenced  testing  should  receive  high 
or  top  priority  in  Army  assessment  programs.  Sixty  percent  felt  that 
criterion-referenced  tests  should  replace  most  or  all  norm-referenced 
tests. 

All  interviewees  felt  that  criterion-referenced  testing  is  prac- 
tical and  useful  in  measuring  job  performance  skills.  No  other  item  on 
the  survey  protocol  elicited  a 100%  positive  response. 


Discussion 


Although  criterion-referenced  testing  is  used  in  today’s  Army, 
many  NRTs  are  in  use  also.  This  is  not  surprising,  since  criterion- 
referenced  testing  is  a relatively  new  concept.  It  was  apparent  from 
the  survey,  however,  that  CRT  use  is  increasing.  School  implementation 
of  criterion-referenced  testing  is  still  in  the  beginning  stages.  Some 
departments  are  making  serious  attempts  to  incorporate  CRTs,  while 
others  are  only  minimally  involved.  Many  employ  criterion-referenced 
terminology,  but  do  not  produce  true  CRTs.  This  is  especially  true  in 


"soft  skill"  areas,  such  as  tactics  and  leadership.  Most  academic  de- 
partments within  the  four  combat  arms  schools  surveyed,  indicated  that 
many  of  their  tests,  especially  the  written  ones,  are  graded  on  a curve. 

MOS  testing  continues  to  be  primarily  norm-referenced.  While  the 
situational  multiple-choice  items  from  which  MOS  tests  are  composed  may 
have  been  developed  in  a criterion-referenced  fashion  (i.e.,  based  on 
objectives) , the  items  appear  suspiciously  similar  to  conventional 
knowledge  test  questions  on  the  surface.  The  proposed  Enlisted  Per- 
sonnel Management  System  (EPMS) , including  the  substitution  of  Skill 
Qualification  Tests  (SQTs)  for  the  present  MOS  tests,  will  presumably 
rectify  this  situation. 

Consideration  of  the  CRT  concept  is  being  applied  in  Training  Ex- 
tension Course  packages.  However,  further  development  and  field  test- 
ing of  the  concept  in  conjunction  with  TEC  is  necessary  before  implemen- 
tation of  TEC  CRTs  becomes  a reality. 

At  Fort  Ord,  California,  CRTs  are  employed  both  in  Basic  Combat 
Training  and  in  Advanced  Individual  Training.  Advanced  Individual 
Training  in  diverse  areas,  such  as  field  wiring  and  food  services,  ap- 
pears to  be  benefiting  from  the  use  of  CRTs.  Preliminary  indications 
are  that  more  soldiers  are  being  evaluated  more  effectively  through  the 
application  of  criterion-referenced  testing. 

In  general,  although  criterion-referenced  testing  is  not  extensive, 
there  are  many  instances  of  serious  attempts  being  directed  at  CRT  de- 
velopment and  use  at  the  Army  installations  visited.  Implementation 
of  CRTs  at  first  appeared  dramatic.  But,  many  of  the  personnel  inter- 
viewed confused  CRTs  with  "hands-on"  performance  testing.  In  order  to 
be  called  criterion-referenced,  test  items  must  be  matched  to  objec- 
tives which  are  derived  from  valid  performance  data.  This  is  not  the 
case  for  a significant  proportion  of  the  "hands-on"  performance  tests 
presently  used  at  the  sites  surveyed. 

On  the  Army  posts  surveyed,  there  was  much  respect  for  the  utility 
and  practicality  of  criterion-referenced  testing.  Despite  this  high 
regard,  there  was  too  little  rigorous  development  or  application  of 
CRTs.  While  progress  is  being  made  toward  achieving  rigor  in  "hard 
skill"  areas,  especially  in  equipment-related  skills,  attempts  in 
"soft  skill"  areas  are  lacking.  This  is  understandable,  since  genuine 
difficulties  in  specifying  soft  skill  objectives  explicitly  are  often 
encountered. 

Interviewees  at  all  levels  indicated  a need  for  increased  devel- 
opment and  use  of  criterion-referenced  testing  in  the  Army.  Many  of 
those  indicated  that  a simple,  practical  CRT  construction  manual  would 
consequently  be  well  received  at  all  levels  in  test  development  and 
evaluation  units. 
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A number  of  difficulties  in  CRT  development  and  use  were  observed 
and/or  described  during  the  survey.  First,  the  development  of  CRTs 
must  be  derived  from  well  specified  objectives  which  are,  in  turn,  the 
results  of  careful  task-analyses.  Unfortunately,  task  analysis  data 
are  not  available  in  many  cases,  and  in  cases  where  they  are  available, 
they  are  often  disregarded. 

The  CRT  survey  suggested  that  practical  constraints  for  task  ob- 
jectives are  usually  assessed  informally.  Frequently,  practical  con- 
straints to  the  testing  situation  are  considered  only  as  an  afterthought. 
Constraints  which  operate  in  the  testing  situation  should  rightfully  be 
considered  while  a test  is  being  developed.  Some  Soldier's  Manual  Army 
Testing  (SMART)  books  for  example,  show  a minimal  regard  for  practical 
testing  constraints.  They  contain  lengthy  checklists  which,  although 
possibly  of  use  in  evaluating  an  individual's  performance,  cannot  be 
followed  by  test  administrators.  The  problem  of  failing  to  consider 
practical  testing  constraints  adequately  may  be  solved  by  training  test 
developers  to  consider  such  factors  as  an  integral  part  of  the  test 
development  process. 

Test  developers  seem  to  have  little  difficulty  creating  items  if 
performances,  standards,  and  conditions  are  accurately  specified  in 
the  objectives.  However,  many  Army  test  developers  surveyed  indicated 
that  they  wrote  only  the  precise  number  of  items  required  for  a specific 
test.  Rarely  are  extra  items  written.  Items  are  typically  reviewed  bv 
subject  matter  experts  and/or  test  evaluation  personnel,  and  are  then 
revised.  Accordingly,  there  is  no  empirical  selection  process  for  final 
test  items. 

Creating  a test  item  pool  should  become  a standard  part  of  the 
test  development  process.  If  twice  as  many  items  are  developed  as  are 
needed  for  a specific  test,  the  test  can  be  tried  out  and  the  final 
items  selected  empirically. 

A poorly  administered  test  defeats  long  hours  of  careful  test  de- 
velopment. The  CRT  survey  indicated  that  lack  of  standardized  testing 
conditions  exist  in  many  areas.  Careful  instructions  in  test  adminis- 
tration are  necessary  to  insure  accurate  testing.  Steps  should  be  taken 
to  insure  that  test  administration  practices  are  clearly  defined  for 
each  test,  and  that  test  administrators  are  adequately  trained. 

Finally,  a major  omission  in  the  development  of  CRTs,  as  observed 
during  the  Army  survey,  is  the  lack  of  test  evaluation.  There  was  vir- 
tually no  consideration  of  test  reliability  and/or  validity,  although 
a small  subset  oi  interviewees  stated  that  they  considered  content 
validity.  Army  test  developers  should  be  instructed  in  techniques  for 
establishing  reliability,  and  both  content  and  empirical  validities  of 
CRTs.  Even  if  a test  evidences  content  validity  as  a function  of  care- 
ful creation  based  upon  task  objectives,  reliability  is  still  in 
question. 
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Part  5 


Developing  the  CRT  Construction  Manual 


ASA  began  development  of  the  CRT  construction  manual  by  consider- 
ing both  the  information  on  the  state-of-the-art  gained  by  the  litera- 
ture review,  and  the  information  on  Army  testing  needs,  as  determined 
by  the  field  survey.  Based  on  these  considerations,  3 content  outline 
of  the  manual  was  prepared.  This  outline  was  submitted  for  review  as 
a part  of  the  project  Interim  Report  (Swezey,  Pearlstein,  and  Ton, 

1974) . Feedback  on  the  proposed  contents  was  obtained  from  the  COTR 
and  his  staff.  Army  Post  Educational  Advisors,  and  other  reviewers. 

The  outline  was  revised  according  to  these  inputs. 

In  order  to  produce  a document  presenting  "how-to-do-it"  procedures 
for  the  construction  of  CRTs,  which  would  be  easily  understandable  by 
officers  and  senior  enlisted  personnel  who  have  little  background  in 
psychometrics,  the  manual  was  prepared  in  accordance  with  the  follow- 
ing objectives: 

1.  Careful  structuring  to  present  one  point  at  a time.  Each 
point  should  involve  one,  "how-to-do-it"  operation. 

2.  Clear,  concise,  and  straightforward  text.  Everyday  terminology 
should  be  used  whenever  possible,  rather  than  specialized  terrni 
nology.  When  psychometric  terms  were  used,  they  were  intro- 
duced as  needed  in  an  operation.  A glossary  of  psychometric 
terminology  was  also  included. 

3.  Practical  examples  drawn  from  real  life  Army  situations  were 
used,  in  lieu  of  abstract  discussions.  Theoretical  discussions 
vere  avoided  entirely. 

The  initial  draft  of  the  CRT  construction  manual  required  four 
calendar  months  to  prepare.  Following  its  preparation,  it  was  reviewed 
by  a number  of  individuals  including  the  COTR  and  his  staff,  represen- 
tat  of  the  Combat  Arms  Training  Board,  representatives  of  Florida 
Statv.  versity ’ s Center  for  Educational  Technology,  Dr.  Harold  Edger- 
ton  (consulting  for  ASA) , and  a psychometric  consultant  selected  by  the 
COTR. 


These  reviewers  carefully  examined  the  draft  manual  for  both  con- 
tent and  structure,  and  submitted  suggested  revisions  to  ASA  over  a two- 
month  period.  ASA  collated  the  suggestions,  resolved  conflicting  sug- 
gestions, and,  after  a thorough  in-house  editorial  review,  revised  the 
draft  manual.  The  revised  draft  manual,  entitled  Developing  Criterion- 
Referenced  Tests  (Swezey  and  Pearlstein,  1974),  was  printed  and  dis- 
tributed for  field  try-outs  and  reviews. 
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Part  6 


Field  Review  Methodology,  Results  and  Discussion 


The  purpose  of  this  section  is  to  describe  the  way  in  which  the 
CRT  Construction  Manual  was  evaluated  i '■'»ld,  and  the  results  of 

the  field  evaluation.  Additionally,  thit  ..ion  presents  a discussion 
of  the  field  evaluation  findings,  in  terms  of  implications  for  further 
refinement  of  the  manual. 


Methodology 

Two  versions  of  a field  review  package  for  use  in  evaluating  the 
CRT  Construction  Manual  were  prepared.  One  version  (Form  1)  was  de- 
signed for  use  by  Army  test  construction  personnel,  while  the  '"ther 
(Form  2)  was  designed  for  use  by  Army  educational  advisors  a. id  by  per- 
sonnel in  Army  test  evaluation  units.  Both  versions  included  an  ex- 
planatory cover  letter  and  an  evaluation  form.  Evaluation  Form  1 was 
intended  to  summarize  the  utility  of  the  manual,  as  evaluated  by  test 
construction  personnel  who  actually  created  CRTs  using  the  manual  step- 
by-step.  Form  2 was  intended  to  summarize  the  manual’s  suitability  for 
the  target  population,  as  assessed  by  Army  test  and  education  experts 
who  read  the  manual  in  detail.  Both  evaluation  forms  consisted  of  two 
sections:  The  first  asked  for  demographic  and  background  data  on  the 
respondent,  while  the  second  consisted  of  35  statements  concerning 
specific  aspects  of  the  manual.  Respondents  were  asked  to  .indicate 
their  level  of  agreement  with  each  statement.  Respondents  were  also 
requested  to  include  additional  comments  to  elaborate  on  their  evalua- 
tions, as  necessary.  Evaluators  using  Form  1 packages  were  asked  to 
send  copies  of  CRTs  developed  in  conjunction  with  the  manual.  Appen- 
dix B to  this  report  presents  copies  of  both  versions  of  the  field  re- 
view packages. 

Field  review  packages  were  distributed  to  the  following  Army 
installations: 

1.  Combat  Arms  Training  Board,  Fort  Benning,  Georgia 

2.  Infantry  School,  Fort  Benning,  Georgia 

3.  Air  Defense  School,  Fort  Bliss,  Texas 

4.  Armor  School,  Fort  Knox,  Kentucky 

5.  Signal  School,  Fort  Gordon,  Georgia 

6.  Basic  Combat  Training  Unit,  Fort  Ord,  CalifornJ  * 

7.  Artillery  School,  Fort  Sill,  Oklahoma. 

Four  copies  of  the  Form  1 package  and  three  copies  of  the  Form  2 pack- 
age were  distributed  to  each  of  the  above  installations.  One  person  at 
each  Army  post  (chosen  on  the  basis  of  familiarity  with  on-post  test 
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construction  personnel,  test  evaluation  personnel,  and  educational  ad- 
visors) was  instructed  on  distribution  of  the  field  review  packages  at 
his  installation.  These  persons  were  also  sent  cover  letters  summariz- 
ing the  distribution  procedure.  A copy  of  this  cover  letter  is  included 
as  Appendix  C to  this  report. 

One  copy  of  the  field  review  package  was  also  sent  to  Fort  Benja- 
min Harrison,  Indiana. 

All  facilities  had  approximately  one  month  during  which  to  use, 
review,  and  evaluate  the  CRT  Construction  Manuals.  Follow-up  telephone 
calls  were  made  to  respondents  whose  comments  required  clarif ication. 

In  addition,  a field  visit  to  Fort  Gordon,  Georgia  was  made  to  observe 
the  field  review  at  the  Signal  School. 


Results  and  Discussion 


Figure  6-1  shows  the  number  of  evaluation  forms  returned.  A total 
of  38  respondents  submitted  field  evaluation  forms.  Figure, 6-2  summa- 
rizes the  respondents,  as  to  version  of  v.  a evaluation  form  used,  rank 
or  title,  and  position.  Test  construct: o< . experience  of  Form  1 users 
ranged  from  6 months  to  25  years,  with  a mean  of  6.5  years.  Form  2 
users’  experience  with  test  construction  ranged  from  2 years  to  35 
years,  with  a mean  of  16.2  years. 

Figure  6-3  shows  Form  1 users’  responses  (in  terms  o 2 percentages) 
to  questions  3,  4,  and  5.  The  sample  of  Form  1 users  had  high  familiar- 
ity with  CRTs  but,  many  more  had  developed  CRTs  than  had  vised  those  de- 
veloped by  others. 

Figure  6-4  shows  the  percent  of  types  of  responses  to  questions  3, 
4a,  and  4b  of  Form  2 users.  The  responses  to  question  3 showed  a simi- 
lar pattern  to  the  equivalent  question  (question  4)  on  Form  1.  Inter- 
estingly, questions  4a  and  4b  indicate  that  the  personnel  in  the  sample 
using  Form  2 were  often  consulted  by  people  having  difficulty  with  CRTs, 
and  that  they  feel  the  CRT  manual  would  have  been  helpful  for  these 
people. 

Responses  to  Items  6 through  40  on  Evaluation  Form  1,  and  Items  5 
through  39  on  Evaluation  Form  2,  form  ordinal  scales  of  measurement. 
Medians  were  therefore  computed  to  describe  the  central  tendency  of 
responses  to  these  items  (Siegel,  1956).  Appendix  D to  this  report 
shows  the  tallies  of  responses  to  Items  6 through  40  on  Form  1,  and 
to  Items  5 through  39  on  Form  2,  as  well  as  the  median  response  for 
each  item.  Reaction  to  the  CRT  Construction  Manual  was  uniformly  fa- 
vorable. The  median  responses  indicated  agreement  with  favorable  state- 
ments about  the  CRT  manual.  It  should  also  be  noted  that,  in  every 
case,  the  median  response  was  also  the  modal  response. 
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Figure  6-1:  Field  Review  Evaluations  Forms  Returned  for  Analysis 


Facility 


f>uantity  of 
Evaluation  Forms 


Fort  Knox,  Kentucky  (Armour  School)  8* 

Fort  Bliss,  Texas  (Air  Defense  School)  7 

Fort  Ord,  California  (BCT  and  AIT  Units)  6 

Fort  Gordon,  Georgia  (Signal  School)  5 

Fort  Sill,  Oklahoma  (Artillery  School)  8* 

Fort  Benning,  Georgia  (Infantry  School)  4 


Total : 38 


•Although  only  7 forms  were  sent  to  each  facility,  the  Armour  School 
and  Artillery  School  reproduced  copies  to  permit  additional,  inter- 
ested test  construction  personnel  to  respond. 


Figure  6-2:  Classification  of  Field  Review  Evaluation  Respondents 


Form  1 Form  2 Totals 


Instructors 

(including  senior  instructors) 
Civilian 

Non-Commissioned  personnel 
Officers 

Education  Specialists 

(Post  Educational  Advisors, 
Training  Specialists,  Education 
Counselors  and  MOS  Specialists) 
Civilian 

Non-Commissioned  personnel 
Officers 

Supervisory  Personnel 

(Branch  and  Division  Chi  .fs  and 
Managers) 

Civilian 

Officers 


TOTALS 


3 
5 

4 


5 


1 

__3 

21 


1 


13 


2 

_1 

17 


3 

5 

5 


18 


3 

4 

38 


6-3 


Figure  6-3:  Percent*  Responses  to  Q3,  Q4,  an d Q5  on  Form  1 (N  * 19) 


Question 

Yes 

Response 

No  No  Response 

x 

3. 

Prior  to  reading  the  CRT  Construction 
Manual,  did  you  know  what  a CRT 
(criterion-referenced  test)  was? 

86%- 

10,5% 

4.5% 

j ■ 
j- 

4. 

Have  you  ever  developed  a CRT  before? 

76% 

19% 

5% 

5. 

Have  you  ever  used  a CRT  developed  by 
someone  else? 

52% 

43% 

5% 

•Rounded  to  nearest  half  percent. 


Figure  6-4:  Percent*  Respbnses  to  Q3,  Q4a,  and  Q4b  on  Form  2 (N  » 15) 


Response 


Question 

Yes 

No 

No  Response 

3.  Have  you  ever  developed  (or , supervised 
development  of)  criterion-referenced 
tests  (CRTs)? 

82% 

12% 

6% 

4a.  Have  you  ever  been  consulted  by  someone 
having  difficulty  in  constructing  or 
using  a CRT? 

82% 

12% 

6% 

4b.  If  so,  do  you  think  the  CRT  manual 

would  have  helped  them  overcome  the 
problem? 

82% 

0 

18% 

•Rounded  to  nearest  half  percent.  ’ 
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Respondents  using  Form  2 agreed  strongly  with  many  more  items 
than  did  Form  1 respondents,  indicating  that  test  evaluators  and  edu- 
cational experts  were  even  more  enthusiastic  than  test  constructors 
(although  both  groups  were  favorably  impressed  by  the  manual) . Re- 
sponses to  Item  12  on  Form  1 indicate  that  the  majority  of  respondents 
strongly  agree  that  the  manual  made  the  distinction  between  criterion- 
referenced  and  norm-referenced  testing  clear.  On  Form  2,  responses  to 
37  percent  of  the  35  statements  had  a median  of  4,  strongly  agree. 

Form  2 respondents  especially  liked  Chapter  1 (Introduction),  Chapter  6 
(CRT  administration  and  scoring) , Chapter  7 (Checking  reliability  and 
validity),  and  the  appendices. 

A few  individuals  in  both  groups  disagreed  with  some  items.*  Con- 
sideration of  their  comments  shed  light  on  their  points  of  disagreement 
The  problems  expressed,  can  be  summarized  as  follows: 

1.  The  manual  does  not  describe  how  to  derive  norm-referenced 
rankings  from  CRTs.  Some  people  are  required  to  rank  class 
members  based  on  test  results.  There  is  no  fool-proof  way  of 
ranking  students  based  on  CRT  results;  in  fact,  the  manual 
discourages  this  practice.  One  respondent  suggested  giving 
individuals  who  successfully  complete  a course,  NRTs  to  de- 
termine rankings.  This  may  be  a useful  suggestion. 

2 . How  to  develop  soft-skill  objectives  was  not  covered  in  the 
manual.  The  manual  was  not  intended  to  cover  development  of 
objectives  per  se,  only  assessment  of  their  adequacy.  Many 
people  experience  difficulty  in  constructing  soft-skill  CRTs, 
primarily  because  they  do  not  have  proper  objectives  upon 
which  to  base  the  tests. 

3.  The  manual  was  easy  to  use,  but  not  easy  enough.  Individuals 
making  comments  of  this  nature  indicated  that  the  manual  was 
easy  enough  for  them  to  use,  but  that  they  thought  others 
might  experience  difficulty  with  the  level  at  which  the  manual 
was  presented. 

4.  Creating  soft-skill  CRT  items  was  not  covered  in  sufficient 
detail.  Some  respondents  felt  that  a separate  chapter  on 
soft-skill  item  construction  might  be  warranted.  However,  if 
soft-skill  objectives  were  more  explicit,  soft-skill  items 
could  be  constructed  in  much  the  same  manner  as  hard-skill 
items . 


* About  3.75%  of  the  responses  on  Form  2 were  unfavorable,  and  about  9% 
of  the  responses  on  Form  1 were  unfavorable. 


6-5 


I 


5.  The  item  analysis  procedure  (using  ) is  clear,  but  is  prob- 
ably impractical  for  use  in  the  field.  Some  respondents  in- 
dicated that  there  is  rarely  enough  time  or  try-out  sample 
members  available,  to  perform  the  reco-  • inded  item  analysis 
procedure.  This  may  be  so,  but  the  prc  adure  recommended  is 
the  simplest,  empirical  item  analysis  technique  practicable. 
Test  constructors,  administrators,  and  supervisory  personnel 
should  be  educated  as  to  the  necessity  of  empirical  item  se- 
lection procedures,  such  as  the  one  recommended. 

6.  Empirical  procedures  for  determination  of  test-retest  relia- 
bility, and  concurrent  and  predictive  validities,  are  easy  to 
do,  but  impractical  for  field  use.  This  difficulty  is  essen- 
tially the  same  as  the  previous  one.  Army  test  constructors 
are  not  accustomed  to  checking  the  reliability  and  validity 
of  their  tests,  so  the  problem  of  educating  them  as  to  the 
necessity  for  these  types  of  test  evaluation  is  even  more 
pronounced . 

7.  The  square  root  tables  (Appendix  D)  do  not  go  up  high  enough. 
The  tables  go  from  1 to  1,000.  An  explanation  should  be  pro- 
vided at  the  beginning  of  Appendix  D on  how  to  use  the  tables 
to  find  the  square  roots  of  numbers  greater  than  1,000. 

8.  The  manual  is  too  lengthy.  Only  a couple  of  respondents  felt 
the  manual  is  too  long.  Nevertheless,  some  consideration 
should  be  given  to  the  development  of  a condensed  version  of 
the  manual. 

9.  Technical  terminology,  though  kept  at  a minimum,  may  conflict 
with  other  terms  in  use  currently.  This  problem  is  nearly 
insoluble,  since  there  are  so  many  terms  in  use  for  the  same 
concept  throughout  the  military.  The  manual  does, , however, 
provide  a glossary  with  synonyms. 

10.  The  emphasis  on  unitary  objectives  is  misleading.  It  tends 

to  imply  an  emphasis  on  testing  at  low  levels  of  task  integra- 
tion. A related  problem  is  that  the  emphasis  in  the  manual  on 
responses;  rather  than  on  cues  (questions)  also  seems  to  imoly 
testing  at  low  levels  of  task  integration. 

All  levels  of  task  integration  were  discussed  in  the 
manual.  What  is  a unitary  objective  at  a low  level  of  task 
integration  might  well  be  a part  of  a more  complex  objective 
at  a higher  level  of  task  integration.  Similarly,  an  appropri- 
ate cue  at  a low  level  of  task  integration  would  probably  be 
inappropriate  at  a higher  level  of  task  integration. 
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11. 


The  manual  emphasizes  full  fidelity  testing  too  much.  In 
addition,  it  does  not  stress  the  importance  of  "psychological 
fidelity."  This  may  be  so,  but,  given  the  lack  of  explicit 
rules  on  when,  where,  and  how  to  reduce  the  fidelity  of  tests, 
it  does  not  seem  appropriate  to  suggest  alternative  approaches. 

12.  The  manual  is  not  sufficiently  critical  of  rating  scale  tech- 
niques. There  are  many  reasons  why  rating  scales  have  not 
worked  well  in  the  past,  especially  those  scales  which  deal 
with  judgments  of  global  behaviors  in  work  settings.  One  of 
the  most  important  of  these  reasons  is  that  people  are  unwill- 
ing to  pass  judgments  on  co-workers  or  subordinates.  Since 
the  manual  stresses  that,  if  rating  scales  are  used,  they 
should  be  behaviorally  anchored,  and  should  be  referenced  to 
discrete,  rather  than  global  behaviors,  this  difficulty  is 
largely  eliminated,  Nevertheless,  whether  or  not  rating 
scales  can  be  used  effectively  with  criterion-referenced 
tests  of  performance-based  training  is  a matter  for  additional 
study. 

The  vast  majority  of  comments  appended  to  the  evaluation  forms  were 
favorable.  Out  of  the  17  i_est  construction  personnel  who  responded  to 
Item  40  on  Form  1,  14  indicated  that  they  plan  to  use  the  procedures 
presented  in  the  manual  when  constructing  CRTs  in  the  future,  and  many 
appended  comments  reflected  this  enthusiasm.  The  following  comments 
are  representative: 

• "improvement  over  usua“l  ^format  for  such  publications  . , . language 
(is]  direct  and  simple  . . . manual  is  comprehensive." 

• "This  manual  is  clear  and  easy  to  read  ..." 

• ".  . . I am  an  education  specialist  [and]  have  been  impressed 
with  the  overview  I have  made  . . . would  like  very  much  to 
have  two  [additional]  copies  of  the  manual." 

• "The  manual  in  its  present  form  could  be  used  as  a reference 
text  in  a course  on  test  ! onstruction  conducted  at  a service 
school. " 

• "The  manual  is  extremely  comprehensive  and  does  not  appear  to 
be  lacking  any  necessary  information.  It  is  also  very  clear 
and  well  written.  It  should  be  very  easy  to  use." 

• . . seems  to  be  excellently  organized  and  in  clear,  precise 
terms . " 

• . . comprehensive  and  extremely  well  written  . . . You  art 
to  be  commended  on  the  fine  job  . . . will  become  a very  sig- 
nificant and  valuable  addition  to  our  Army  literature  on  test 
design." 
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• . . provides  the  kind  of  information  a 'how-to*  manual  should 

provide . " 

Six  individuals  sent  ASA  copies  of  CRTs,  and  associated  materials, 
that  they  created  in  conjunction  with  their  review  of  the  CRT  manual. 
Four  of  these  CRTs  were  in  hard-skill  areas,  and  two  were  in  soft-skill 
areas.  Test  constructors,  who  used  the  manual  to  guide  them  in  creat- 
ing CRTs,  achieved  impressive  results  for  the  most  part.  Of  special 
note  was  a soft-skill  CRT  and  supporting  documents  that  comprised  a 
package  of  23  typewritten  pages,  and  was  excellent  in  concept  and 
implementation. 

In  addition  to  cccnments  appended  to  the  evaluation  forms,  ASA  re- 
ceived many  favorable  comments  from  unsolicited  sources  in  both  military 
and  civilian  spheres.  Nearly  20  such  i viduals,  to  whom  we  did  not 
directly  send  the  manual,  contacted  ASA  to  mention  their  favorable  im- 
pressions with  the  manual. 
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Recommendations 


The  purpose  of  this  section  is  to  present  recommendations  for 
future  research  on,  and  implementation  of  criterion-referenced  measure- 
ment in  the  Army. 


Recommendations  for  Future  CRT  Research  and  Implementation 

1.  A research  effort  should  be  conducted  to  assess  the  feasibility  of 
developing  and  using  criterion-referenced  MOb  tests.  Minimally, 
this  research  should  encompass  the  following  phases: 

A.  Outline  procedures  for  converting  existing  low  fidelity,  norm- 
referenced  MOS  tests  to  higher  fidelity,  criterion-referenced 
MOS  tests.  These  procedures  could  be  based  on  those  presented 
in  the  CRT  Const • action  Manual,  Developing  Criterion-Referenced 
Tests. 

B.  Construct  criterion-referenced  versions  of  traditional  MOS  tests 
in  both  hard-skill  and  soft-skill  areas,  demonstrating  the  fa- 
cility and  cost-effectiveness  with  which  such  tests  can  be 
created  using  the  procedures  outlined  during  phase  A above. 

The  criterion-referenced  MOS  tests  should  be  performance-oriented, 
at  as  high  a level  of  fidelity  congruent  with  practicality  of 
administration  and  maintenance  of  adequate  objectivity. 
Criterion-referencing  of  MOS  tests  should  render  them  more 
isomorphic  to  their  intended  purpose:  Assessment  of  individual 
pe. tormance  levels  within  the  occupational  specialties. 

C.  Compare  important  psychometric  properties  of  norm-referenced 
and  criterion-referenced  tests  via  fiel •:?  try-out  procedures. 

The  comparisons  made  should  include  test-retest  reliability  and 
both  concurrent  and  predictive  validities.  In  addition,  ease 
of  administration  and  ease  of  scoring  should  be  compared  for 
the  traditional  and  criterion-referenced  versions  of  the  MOS 
tests.'  By  conducting  these  various  comparisons,  cost-benefit 
analyses  of  traditional  and  criterion-referenced  MOS  tests,  in 
both  hard-skill  and  soft-skill  areas,  can  be  computed. 

2.  Develop  more  precise  objectives  for  soft-skill  areas.  A series  of 
research  efforts  should  address  the  development  of  operationally- 
defined  objectives,  amenable  to  behavioral  assessment,  in  soft-skill 
areas.  It  is  inherently  more  difficult  to  develop  objectives  in 
soft-skill  areas  than  in  hard-ckill  areas.  Although  the  concept  of 
criterion-referenced  testing  is  equally  applicable  to  both  areas. 
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the  actual  establishment  oJ  legitimate  objectives  is  more  difficult 
in  areas  such  as  leadership,  discipline,  tactics,  etc.,  than  in 
more  operational  areas,  such  as  M16  assembly/disassembly,  first 
aid,  etc.  Although  difficult,  the  development  of  soft-skill  ob- 
jectives is  certainly  possible. 

Establishment  of  performance-oriented  objectives  in  soft-skill 
areas  is  frequently  time-consuming  and  tedious.  Real  ingenuity  is 
also  often  required.  But,  although  difficult,  such  objectives  can 
be  created. 

Research  efforts  should  develop  and  demonstrate  specific  tech- 
niques for  creating  adequate  soft-skill  objectives.  These  techniques 
would  find  an  appreciative  audience  in  the  Army,  as  indicated  by 
comments  received  during  conduct  of  the  present  study. 

3.  Implement  a program  to  instruct  all  Army  training  and  evaluation 

oriented  personnel  in  Criterion-Referenced  Testing.  It  is  apparent 
that  much  attention  is  devoted  to  CRT  concepts  in  the  Army.  Yet 
few  individuals  are  actually  familiar  enough  with  these  concepts 
to  use  them  properly.  A program  should  be  implemented  which  will 
provide  training  in  Criterion-Referenced  Testing  for  persons  at 
all  levels  in  the  Army  hierarchy  who  are  concerned  with  the  devel- 
opment of  procedures  for  evaluating  performance.  Special  emphasis 
should  be  placed  upon  the  necessity  for  empirical  item  selection 
procedures  (i.e.,  item  analysis)  and  empirical  evaluation  of  CRTs’ 
reliability  and  validity. 


4.  Implement  a program  to  train  test  administrators  in  standardization 


for  MOS  to  SQT  (Skill  Qualification  Test — a criterion-referenced 
MOS  test)  conversion  and  validation.  The  Enlisted  Personnel  Man- 
agement System  (EPMS)  conference,  held  at  Fort  Benjamin  Harrison, 
Indiana  on  15-17  October,  1974,  resulted  in  the  recommendation  that 
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a.  manual  be  developed  on  how  to  construct  and  validate  criterion- 
referenced  MOS  tests  (SQTs) . The  current  manual.  Developing 
Criterion-Referenced  Tests,  is  appropriate,  if  merged  with  the 
recently-generated  Item  Writer's  Guide,  and  modified  to  be  specifi- 
cally oriented  to  MOS  tests. 

7.  Document  procedures  for  developing  soft-skill  objectives.  As  noted 
in  the  discussion  on  the  Field  Review  of  the  CRT  Construction  Manual, 
some  Army  test  developers  have  difficulty  in  constructing  soft-skill 
CRTs,  primarily  because  they  do  not  have  adequate  soft-skill  objec- 
tives from  which  to  work. 

8.  Develop  procedures  for  using  NRTs  (or  other  Indices)  in  conjunction 
with  CRTs.  Many  Army  test  developers  are  concerned  about  the  re- 
quirement that  they  provide  norm-referenced  information  on  examinees 
who  have  been  tested  by  CRTs.  This  is  a genuine  problem  with  no 
easy  resolution.  Procedures  for  simultaneously  using  CRT  and  other, 
norm-referenced  indices  should  be  developed  for  situations  requiring 
both  norm-referenced  decisions  and  criterion-referenced  decisions. 

9.  Develop  a condensed  version  of  the  CRT  Manual.  A condensed  version 
of  the  CRT  construction  manual  would  be  valuable  for  personnel  who 
are  already  fairly  familiar  with  CR  testing.  This  version  should 
omit  much  of  the  detail  (and  introductory  material)  presented  in 
Developing  Criterion-Referenced  Tests. 
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Interviewer  Statement:  Now,  I would  like  to  discuss  with  you,  some  tasks  that 
may  be  involved  in  test  construction  and  use.  These  tasks  are  done  in  different 
ways  in  different  places;  Sometimes  they  are  combined,  in  other  cases  some 
are  eliminated.  They  often  go  by  different  names.  Would  you  please  tell  me 
which  of  these  you  are  involved  in. 

*4.  Writing  objectives.  That  is--determining  what  the  test  will  measure  and 
the  conditions  under  which  the  measurement  will  occur  in  terms  of  precise, 
behavioral  statements. 

Have  you  been  involved  in  writing  objectives?  Yes No 

If  yes,  (a)  how  long  have  you  been  doing  this?  Years Months 

(b)  do  you  write  objectives  in  operational,  behavioral  terms? 

Yes No Don't  understand 

*5.  Setting  standards.  That  1s--defining  the  standards  against  which  per- 
formance is  evaluated.  In  many  cases,  these  standards  are  very  similar 
/ 

to  the  stated  objectives. 

Have  you  participated  in  setting  standards?  Yes No 

If  yes,  how  long  have  you  been  doing  this?  Years Months 

k 6.  Imposing  practical  constraints.  That  is — deciding  how  the  test  must  be 
built  so  it  can  actually  be  used  within  the  limits  of  the  situation  for 
which  it  is  designed.  For  example,  there  are  often  time  constraints 
Involved  in  testing  complex  skills. 

Have  you  been  involved  in  this?  Yes No 

If  yes,  how  long  have  you  been  doing  this?  Years  Months 
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*11.  Measuring  reliability.  That  Is— determining  if  a test  will  give  similar 
scores  when  measuring  similar  performance.  For  example,  a person  talcing 
equivalent  versions  of  the  same  test  should  score  about  the  same  on  both. 
If  he  has  had  no  practice  in  between. 

Have  you  been  involved  in  measuring  the  reliability  of  tests?  Yes No_ 

If  yes,  (a)  how  long  have  you  been  involved  in  measuring  reliability? 

Years Months 

(b)  do  you  compute  coefficients  of  reliability? 

Yes No Don't  know 

*12.  Evaluating  validity.  The  test  developer  must  determine  whether  the  test 
Is  actually  measuring  what  it  is  supposed  to  measure.  Personnel  who  score 
high  on  the  test  should  also  perform  very  well  on  the  task  that  test  Is 
supposed  to  measure,  while  those  who  score  low  should  not  be  able  to 
perform  the  task  as  well. 

Have  you  helped  to  validate  tests?  Yes No 

If  yes,  (a)  how  long  have  you  been  doing  so?  Years Months 

(b)  do  you  use  content  validity  as  opposed  to  predictive  validity? 

Yes No Don't  know 

13.  Scoring.  How  are  tests  generally  scored?  Are  norms  set  as  standards 
using  bell  shaped  curves,  or  are  "go-no  go"  type  standards  used? 

Norms go- no  go Other 
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Interviewer  Statement:  Now  I would  like  to  discuss  some  of  the  tasks  that 
you're  involved  in. 

19.  What  Inputs  do  you  have  available  in  terms  of  documents,  data,  job  aids, 
field  manuals,  etc.?  REQUEST  THFSE 


20.  Which  of  these  Inputs  do  you  actually  use? 


*21.  [If  answer  to  20  is  other  than  "all  of  them",  interviewer  asks  #21] 
Why  do  you  use  these  and  not  the  others? 


22.  What  products  do  you  prepare?  REQUEST  THESE 


23.  How  are  these  outputs  used? 


25.  How  did  you  resolve  these  problems? 


*26.  Is  any  special  training  available  for  testing  personnel?  Yes 
If  yes,  please  briefly  describe  this  training? 


27.  What  proportion  of  the  tests  you  have  participated  in  making  or  using  are: 

A.  Paper-and-pencil  knowledge  tests?  

B.  Simulated  performance  tests?  E.g.,  using 

mockups  and  drawings  ’ 

C.  "Hands  on"  performance  tests?  

D.  Other?  Specify:  ___ 


What  proportion  of  the  tests  you  have  participated  in  making  or  using  are 


A.  Specific  skill  and  knowledge  requirements?  

B.  Specialty  areas  in  a course?  

C.  End  of  block  within  a course?  

D.  Mid  cycle  within  a course?  ■ 

E.  End  of  course?  

*28.  Are  you  familiar  with  any  team  performance  situations  that  were  evaluated 

by  tests?  Yes No 

*29.  Would  you  briefly  describe, how  tests  were  used  to  measure  team  performance? 
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30.  Have  time  pressures,  or  other  constraints,  prevented  you  from  successfully 
carrying  out  some  of  the  tasks  Involved  in  test  construction  and  use? 

Yes Ho 

If  yes,  describe  how  you  were  affected  by  a constraint. 

/ ■ . 

■ — — — — . — — 

*31.  Can  you  describe  any  cases  in  which  tests  were  developed  which  were  not 

suitable,  in  your  opinion,  for  the  intended  uses?  Yes No 

Description!  
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If  it  is  the  interviewer's  opinion  that  interviewee 


does  not  understand  the  distinction  between  Criterion- 
Referenced  Testing  and  norm-referenced  testing: 


STOP  HERE 


Otherwise  go  on. 
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32.  One  of  the  main  purposes  of  our  work  for  the  A-my  Is  to  develop  a manual 
on  how  to  construct  Criterion-Referenced  as  opposed  to  Norm-Referenced 
Tests.  Who  will  be  the  primary  users  of  a manual  of  this  type  on  this 
post? 


*33.  As  you  know,  in  recent  years  the  Army  has  put  increasing  emphasis  on  using 
Criterion-Referenced  Tests  in  appropriate  testing  situations.  There  is 
still  much  disagreement,  though,  about  what  a Criterion-Referenced  Test 
really  is.  How  is  the  term  "Criterion-Referenced  Test"  used  on  this  post? 


*34.  How  strongly  do  you  feel  about  future  use  of  Criterion-Referenced  Testing 
in  the  Army?  Should  Criterion-Referenced  Test  development  receive  high 
or  low  priority  in  terms  of  Army  assessment  programs? 

Strongly  against--Cri terion-Referenced  Testing  should  receive  bottom 

priority,  or  dropped  entirely. 

'Against--Criteriori-Referenced  Testing  should  receive  low  priority. 

Neutral— Criterion-Referenced  Testing  should  receive  average  priority. 

For—Cri terion-Referenced  Testing  should  receive  high  priority. 

Strongly  for— Cri terion-Referenced  Testing  should  receive  top 

priority,  Criterion-Referenced  Tests  should  replace  most  or  all 
norm- referenced  tests. 
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*35.  Do  you  think  cost  is  a major  factcr  in  determining  whether  Criterion- 
Referenced  Tests  are  developed  and  administered  in  the  Army?  That  is— 
have  you  found  that  Criterion-Referenced  Tests  are  more  or  less  expensive 
to  develop  and  administer  than  conventional,  norm- referenced  tests? 

Less  expensive About  the  same More  expensive 

*36.  Could  you  describe  a situation  in  which  a Criterion-Referenced  Test  was 
found  to  be  prohibitively  expensive  to  develop? 


37.  Do  you  think  that  there  are  any  particular  advantages  or  disadvantages  to 
developing  and  using  Criterion-Referenced  tests  in  the  Army  (as  opposed 

to  norm- referenced  measures)?  Yes No 

What  are  some  advantages  or  disadvantages? 


38.  Are  there  any  special  problems  you  have  encountered  while  developing 
or  using  Criterion-Referenced  Tests,  as  opposed  to  problems  normally 

encountered  with  norm-referenced  tests?  Yes No_ 

If  yes,  describe  these  special  problems  and  how  you  overcome  them: 


*39.  How  serious  are  these  problems?  That  is,  how  much  do  they  affect  the 
overall  kc^ompl ishment  Oi  testing  objectives? 
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Field  Review  Evaluation  Packages: 
Form  1 and  Form  2 


DRAFT  LETTER 

(for  use  with  Evaluation  Form  #1) 


Dear  Sir,  » 

Applied  Science  Associates,  Inc.  (ASA)  wishes  to  solicit  your  aid 
in  assessing  the  enclosed  version  of  a criterion-referenced  test  construction 
manual  entitled  Developing  Criterion-Referenced  Tests.  This  manual,  devel- 
oped under  contract  No.  DAHCY!T7T-C-00T8  foFThe  Army  Research  Institute 
for  the  Behavioral  and  Social  Sciences  is  intended  to  aid  Army  test  devel- 
opers in  the  construction  of  criterion-referenced  tests  ' ?%Ts).  Your 
comments  and  suggestions  will  be  used  to  help  revise  th»..  version  of  the 
manual . 

Here  is  how  you  can  help: 

1.  Read  the  manual,  familiarizing  yourself  with  its  contents. 

2.  Develop  a CRT  of  your  own  (f^r  whatever  use  Is  aopropriate  to 
your  testing  needs),  following  the  procedures  presented  in 
the  manual.  Use  the  manual  step-by-step  as  you  develop  this 
test. 

3.  Fill  out  the  enclosed  evaluation  form  indicating  how  useful 
the  manual  was  in  guiding  you  through  the  test  construction 
process.  Feel  free  to  include  additional  comments  which  you 
think  would  be  helpful  to  us  for  revising  the  manual. 

4.  Send  a copy  of  the  test  you  constructed,  and  associated 
documentation  if  possible,  along  with  the  completed 
evaluation  form  to: 

APPLIED  SCIENCE  ASSOCIATES.  INC.  ’ 

11800  Sunrise  Valley  Drive 
Res ton,  VA  22091 

If  ASA  receives  your  materials  by  1 November,  1974,  we  will  be  able 
to  consider  your  evaluation  within  the  time  constraints  imposed  by  the 
contract. 

Please  call  ASA  at  (703)  620-3494  if  you  have  any  questions. 
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Evaluation  Form  1 


Please  use  this  evaluation  form  to  Indicate  how  helpful  the  CRT  Manual 
was  In  guiding  you  through  the  test  construction  process. 


Name:  ’ 

Rank  or  Title  First  Middle  Last 

Initial 


Address  on  Post:  i 

Bldg  & Number 

Street  Address, 

if  applicable 

Post 

City 

State  Zip  Code 

Phone  Number:  

Area  Code  Humber  on  Post  at 
which  you  can  be 
reached 


1.  What  Is  your  position?  [for  example:  Senior  Instructor,  Nuclear- 
biological  -chemical  committee] 


2.  How  long  have  you  been  Involved  with  some  aspect  of  test  construction? 


years  months 

3.  Prior  to  reading  the  CRT  construction  manual,  did  you  know  what  a CRT 
(criterion-referenced  test)  was? 

Yes  No  Circle  one. 

4.  Have  you  ever  developed  a CRT  before?  Yes  No  Circle  one. 

5.  Have  you  ever  used  a CRT  developed  by  someone  else? 

Yes  No  Circle  one. 
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THE  FOLLOWING  -SiftJEMENTS  CONCERN  MATERIAL  IN  CHAPTER  3 

18  The'eoqeept  of  practical  constraints,  arid  how  they  may 
coris train  testing  of  all  objectives  as  stated,  was 
. adequately  presented. 

19.  How  to  overcome  practical  constraints— either  by  selecting 
among  objectives  or  by  modifying  objectives  In  light  of 
the  constraints  •"V»ua  clear. 

20.  When  and  how  to  sample  Items  for  objectives  was  easily 
understandable. 

21.  Testing  under  multiple  conditions  and  how  to  sample 
multiple  conditions  was  clear. 

22.  The  guidelines  for  determining  how  many  items  to  include 
In  a CRT  were  helpful. 

THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  4 


23.  The  explanation  of  how  to  create  items  based  on  test  plan 
specifications  was  adequate. 

24.  The  material  concerning  developing  specific  and  general 
test  instructions  was  helpful  and  at  the  right  level  of 
detail. 

25.  The  section  on  assessing  adeqjacy  of  items  was  clear  and 
useful . 


THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  5 


26.  How  to  select  a proper  try-out  sample  and  conduct  an  item 
pool  try-out  was  clear  and  easy  to  follow. 

27.  Computing  item  analysis  values  using  (fi  on  try-out  results 
was  presented  in  a clear  fashion. 

28.  How  to  reduce  the  item  pool  by  considering  try-out  results, 
item  analysis,  and  item  reviews  was  presented  adequately. 

29.  What  to  do  if  too  few  or  too  many  items  were  left  after 
item  analysis  and  reviews  was  clear. 


THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  6 


30.  The  material  on  standardizing  test  administration  procedures 
and  administering  CRTs  was  clear  and  useful. 

31.  The  information  on  how  to  scorp  CRTs,  establish  cut-off 
scores,  and  report  test,  results  was  adequate. 


12  3 4 
12  3 4 


12  3 4 
12  3 4 
12  3 4 
12  3 4 


12  3 4 
12  3 4 

12  3 4 


A 2 3 4 

12  3 4 

12  3 4 
12  3 4 
12  3 4 
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THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  7 

32.  The  procedure  for  determining  test-retest  reliability  was 
clear  and  easy  to  follow. 

33.  How  to  assess  content  validity  was  clear. 

34.  How  to  determine  concurrent  validity  was  presented  adequately. 

35.  How  to  determine  predictive  validity  was  presented  adequately. 
THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  APPENDICES 

36.  Appendix  A (Checklist  for  constructing  CRTs)  was  useful. 

37.  Appendix  B (Checklist  for  evaluating  CRTs)  would  be  helpful 
in  evaluating  CRTs  that  have  already  been  developed. 

38.  Appendix  C (Glossary)  was  helpful  and  contained  all  terms 
I needed  to  look  up. 

39.  Appendix  D (Square  root  tables)  was  useful  in  calculating 
values  of  a . 

40.  I plan  to  use  the  procedures  presented  in  this  manual  when 
developing  CRTs  in  the  future. 


PLEASE  FEEL  FREE  TO  INCLUDE  ADDITIONAL  COMMENTS  (ATTACH  SEPARATE  SHEETS 
AS  NECESSARY). 


12  3 4 
12  3 4 

12  3 4 

12  3 4 

12  3 4 


1 

2 

3 

4 

1 

2 

3 

4 

1 

2 

3 

4 

1 

2 

3 

4 
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DRAFT  LETTER 

(for  use  with  Evaluation  Form  #2) 


Dear  Sir, 

Applied  Science  Associates,  Inc.  (ASA)  wishes  to  solicit  your  aid 
in  assessing  the  enclosed  version .of  a criterion-referenced  test  construction 
manual  entitled  Developing  Criterion-Referenced  Tests.  This  manual,  devel- 
oped under  contract  No."  DAHC1 9-74-C-OOl 8 for  the  Army  Research  Institute 
for  the  Behavioral  and  Social  Sciences  is  intended  to  aid  Army  test  devel- 
opers in  the  construction  of  criterion-referenced  tests  (CRTs).  Its  target 
audience  is  composed  of  senior  enlisted  personnel  and  officers  who  are 
involved  in  test  construction,  but  who  may  not  be  sophisticated  with  respect 
to  psychometric  techniques. 

Your  comments  and  suggestions  will  be  used  to  help  revise  this  version 
of  the  manual . 

Here  is  how  you  can  help: 

1 . Read  the  manual . 

2.  Complete  the  enclosed  evaluation  form  to  evaluate  the  suitability 
of  the  manual . 

3.  Feel  fre  to  include  any  additional  corments  which  you  think 
would  be  helpful  to  us  for  revising  the  manual. 

Please  send  the  completed  evaluation  form  and  any  additional  materials 
to: 


APPLIED  SCIENCE  ASSOCIATES,  INC. 

11800  Sunrise  Valley  Drive 
Res ton,  VA  22091 

In  order  to  be  able  to  use  your  evaluation  within  the  time  constraints 
imposed  by  the  contract,  ASA  must  receive  your  inputs  by  1 November,  1974. 

Please  call  ASA  at  (703)  620-3494  if  you  have  any  questions. 
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Evaluation  Form  2 

Pease  use  this  evaluation  form  to  indicate  how  useful  you  think  the 
CRT  Manual  will  be  for  Army  Test  Constructors. 


Name: 

Rank  or  Title 

First  Middle 

Initial 

Last 

Address  on  Post: 

HI 

Street  Address,  if  applicable 

| Post 

City  State  Zip  Code 

Phone  Number:  

Area  Code  Number  on  Post  at 
which  you  can  be 
reached 


1.  What  is  your  position?  [for  example:  Post  Educational  Advisor] 


2.  How  long  have  you  been  involved  with  test  construction  and  related  issues? 


years  months 

3.  Have  you  ever  developed  (or  supervised  development  of)  criterion-referenced 
tests  (CRTs)? 

Yes  No  Circle  one. 

4.  Have  you  ever  been  consulted  by  someone  having  difficulty  in  constructing 

I or  using  a CRT? 

* 

Yes  No  Circle  one. 

If  so,  do  you  think  the  CRT  manual  would  have  helped  them  overcome  the 
problem? 

Yes  No  Circle  one. 

C 
i 
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Directions:  The  remainder  of  this  evaluation  form  consists  of  statements 
about  the  manual , Developing  Criterion-Referenced  Tests.  Eauh  statement 
Is  preceded  by  the  numbers  1 through  4. 

Circle  1 if  you  strongly  disagree  with  the  statement. 

Circle  2 if  you  disagree  with  the  statement. 

Circle  3 if  you  agree  with  the  statement. 

Circle  4 if  you  strongly  agree  with  the  statement. 

Please  respond  to  each  statement,  circling  the  number  which  best  expresses 
your  opinion.  Remember  the  manual's  audience  may  be  composed  of  people 
who  are  not  sophisticated  with  respect  to  psychometric  concepts  and 
terminology. 


5.  The  manual  would  be  very  helpful  in  guiding,  people  through 
the  CRT  construction  process. 

6.  The  manual  would  be  easy  for  Army  test  developers  to  use. 

7.  Examples  provided  in  the  manual  are  useful. 

8.  The  manual  covers  all  the  points  it  should. 

9.  I would  recommend  that  this  manual  be  used  by  Army  test 
developers  whenever  possible. 

FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  1 


10.  The  concept  of  criterion-referenced  testing  is  explained 
clearly. 

11.  The  explanation  of  when  to  develop  CRTs  is  clear  and  accurate. 

12.  The  distinctions  between  criterion-referenced  and  norm- 
* referenced  testing  are  clear. 

13.  The  overview  of  the  CRT  construction  process  provides  a 
clear  idea  of  what  the  manual  covers. 


FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  2 

14.  The  discussion  of  the  three  main  parts  of  an  objective  is 
clear  and  comprehensive. 

15.  The  process  of  establishing  unitary  objectives  Is  clear. 

16.  The  distinctions  among  overt  main  intents,  covert  main  intents 
and  indicators  are  clear. 

17.  The  sequence  of  operations  for  assessing  the  adequacy  of 
objectives  is  clear  and  to  the  point. 
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THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  3 


18.  The  concept  of  practical  constraints,  and  how  they  may 
constrain  testing  of  all  objectives  as  stated.  Is  adequately 
presented. 

19.  The  procedures  for  overcoming  practical  constraints— either 
by  selecting  among  objectives  or  by  modifying  objectives 

in  Tight  of  the  constraints—are  appropriate  and  presprted 
clearly. 

20.  When  and  how  to  sample  items  for  objectives  is  presented 
adequately. 

21.  The  information  on  how  to  test  under  multiple  conditions 
(including  how  to  sample  multiple  conditions)  Is  appropriate 
and  clear. 

22.  The  guidelines  for  determining  how  many  items  to  include 
in  a CRT  are  helpful. 


THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  4 


23.  How  to  create  items  based  on  test  plan  specifications  is 
explained  adequately. 

24.  The  material  on  developing  specific  and  general  test 
instructions  is  helpful  and  at  the  right  level  of  detail. 

25  The  section  on  assessing  the  adequacy  of  items  is  clear 
and  useful. 


THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  5 


26.  The  procedure  for  selecting  a proper  try-out  sample  and 
conducting  an  item  pool  try-out  is  clear  and  easy  to  follow. 

27.  The  presentation  of  how  to  do  an  item  analysis  using  $ 
is  clear  and  appropriate. 

28.  The  material  on  reducing  the  item  pool  by  considering 
try-out  feedback,  item  analysis,  and  item  reviews  is  adequate. 

29.  What  to  do  if' too  few  or  too  many  items  remain  after  reduc- 
tion of  the  Item  pool  is  clear  and  appropriate. 

THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  6 

30.  The  material  on  standardizing  test  administration  procedures 
and  administering  CRTs  is  clear  and  useful. 

31.  How  to  score  CRTs,  establish  cut-off  scores,  and  reDort 
test  results  are  explained  adequately. 
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THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  CHAPTER  7 

32.  The  procedure  for  determining  test-retest  reliability  Is 
appropriate  and  presented  clearly. 

33.  How  to  assess  content  validity  is  clear  and  practical. 

34.  How  to  determine  concurrent  validity  is  appropriate  and 
presented  clearly. 

35.  How  to  determine  predictive  validity  is  appropriate  and 
presented  clearly. 


THE  FOLLOWING  STATEMENTS  CONCERN  MATERIAL  IN  APPENDICES 


36.  Appendix  A (Checklist  for  constructing  CRTs)  Is  useful. 

37.  Appendix  B (Checklist  for  evaluating  CkTs)  is  useful. 

38.  Appendix  C (Glossary)  is  useful  and  covers  all  necessary 
terms. 

39.  Appendix  D (Square  root  tables)  is  useful  and  appropriate 
for  this  manual. 


PLEASE  FEEL  FREE  TO  INCLUDE  ADDITIONAL  COMMENTS  (ATTACH  SEPARATE  SHEETS 
AS  NECESSARY) 
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APPENDIX  C 


Cover  Letter  to  Contact  Man  at  Each  Post 
Describing  How  Materials  Are  To  Be 
Distributed 


s 


Dear  ■ » 

In  accordance  with  our  recent  telephone  conversation,  enclosed  are  the 
materials  you  need  to  help  in  evaluating  the  CRT  construction  manual,  Developing 
Criterion-Referenced  Tests.  Seven  (7)  copies  of  the  manual,  and  seven  (7) 
field  review  packages — each  consisting  of  an  explanatory  cover  letter  and 
an  evaluation  form,  are  Included. 

Please  note  that  four  (4)  field  review  packages  are  labeled  "Form  1", 
and  three  (3)  are  .labeled  "Form  2".  Please  distribute  the  packages  as 
follows: 

1.  Keep  one  copy  of  the  manual  and  one  "Form  2"  field  review  package 
for  yourself.  Follow  the  directions  in  the  cover  letter  enclosed 
in  the  field  review  package. 

2.  Select  :wo  (2)  people  on  your  post  who  are  experienced  in  test 
construction  methodology,  educational  technology,  or  test  evaluation. 
Give  each  a copy  of  the  manual  and  a "Form  2"  field  review  package. 

Ask  'them  to  follow  the  directions  in  the  cover  letter. 

3.  Select  four  (4)  people  on  your  post  who  are  actively  involved  in 
test  construction  tasks.  These  may  be  instructors,  senior  instruc- 
tors, etc.  Give  each  a copy  of  the  manual  and  a "Form  1"  field 
review  package.  Ask  them  to  follow  the  directions  in  the  cover 
letter. 

It  is  important  to  remember  that  all  completed  evaluations  must  be 
received  by  Applied  Science  Associates,  Inc.  (ASA)  by  1 November,  1974. 
Consequently , the  Interval  in  which  evaluations  must  be  completed  is 
relatively  brief.  To  ensure  meeting  deadlines,  it  is  important  that  you 
distribute  these  materials  as  soon  as  possible. 

If  you  have  questions  concerning  appropriate  candidates  to  receive  the 
field  review  packages  and  manuals,  please  feel  free  to  contact  ASA  at 
(703)  620-3494.  Thank  you  very  much  for  your  cooperation. 

Sincerely, 


Robert  W.  Swezey,  PhD 

Applied  Science  Associates,  Inc. 

Richard  B.  PearlsteJn,  PhD 
Applied  Science  Associates,  Inc. 


Angelo  Mirabella,  PhD 
Army  Research  Institute 


APPENDIX  D 


Results  of  Field  Review  Evaluation: 
Tallies  of  Responses  on  Form  1 
and  Form  2 
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